Large language models, most notably ChatGPT and GPT-4 continue to be the hottest topic in the AI industry, and the language services industry is scrambling to understand what kind of impact the latest generative AI technology will have.
Microsoft researcher Christian Federmann said in a SlatorPod episode on April 12, 2023, that he expects the fervor over generative AI to hit a wall within the next six months as people realize the real-world limitations of the technology (watch the podcast to get his full thoughts), but for now we are firmly entrenched in the hype part of the cycle.
And while there are growing concerns over whether large language models will lead to translation job losses, there is also a lot of interest in how these models perform on translation tasks.
Researchers have started to publish their initial investigations into the translation capabilities of ChatGPT, and this article takes a look at four more papers, all published within the last few weeks, that look at ways to optimize ChatGPT for different translation tasks. Notably, three of the four were published by researchers at major Chinese tech firms.
Building Better Prompts
Researchers at Massey University in New Zealand looked at ways to “unleash the power of ChatGPT” for machine translation through designed prompts. They found that “ChatGPT with designed translation prompts can achieve comparable or better performance over professional translation systems for high-resource language translations”, but lagged significantly on low-resource translations. The designed prompts included additional information such as translation direction and what type of content was being translated.
Report authors also looked at how other “auxiliary data” such as parts of speech tags impacted translation, with mixed results.
In another paper, researchers at Chinese tech firm Tencent’s AI Lab (the paper includes most of the same Tencent researchers that wrote an earlier paper on ChatGPT covered here) looked to develop a new framework for interaction with chat-based LLMs like ChatGPT that would yield better translation results.
Their framework “reformulates translation data into the instruction-following style, and introduces a “Hint” field for incorporating extra requirements to regulate the translation process,” which they say “improves the translation performance of vanilla LLMs significantly.”
Researchers at JD Explore Academy, part of Chinese e-commerce giant JD.com, published a paper that investigated ways to improve ChatGPT’s ability to evaluate translations. They introduce a new way of prompting LLMs and specifically ChatGPT for translation evaluation that they call Error Analysis Prompting that “can generate human-like MT evaluations at both the system and segment level.”
The report authors cited a recent paper from Microsoft researchers Federmann (mentioned above) and Tom Kocmi that found ChatGPT’s ability to assess the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. JD’s researchers say their method takes this research a step further, with promising results.
“ChatGPT and GPT-4 have demonstrated superior performance and show potential to become a new and promising paradigm for document-level translation” — Tencent AI Lab
Despite the positive results, JD’s researchers also found that ChatGPT had some limitations as an MT evaluator, including giving different scores to the same translation and showing preference for the earliest text in a query when multiple translations were provided.
This paper follows another that JD’s researchers published (with some of the same authors) that looked at ways to improve ChatGPT’s translation outputs.
Finally, Tencent’s AI Lab researchers published another paper, this one investigating how LLMs perform at document-level machine translation and discourse phenomena such as entity consistency, referential expressions, and coherence. The researchers found that ChatGPT outperformed commercial machine translation systems (they tested against Google Translate, DeepL and Tencent’s own TranSmart service) in terms of human evaluation of discourse awareness, though they underperformed against the d-BLEU benchmark.
“ChatGPT and GPT-4 have demonstrated superior performance and show potential to become a new and promising paradigm for document-level translation,” the authors said.