Are we on the cusp of yet another turning point in machine translation (MT), where large language models (LLMs) with the ability to perform all kinds of downstream tasks also excel at MT, rendering current specialized MT models obsolete?
Spoiler alert: Not yet. This according to a November 16, 2022 research paper by Google demonstrating that LLMs have impressive MT capabilities but still lag behind state-of-the-art (SOTA) MT.
The researchers from Google carried out an in-depth investigation into the sentence-level translation capabilities of their Pathways Language Model (PaLM), experimenting with different prompting strategies and assessing the resulting performance.
The team selected PaLM because it had “demonstrated the strongest MT performance among similarly-trained LLMs to date.”
We find that, although impressive, the sentence-level translation capacity of LLMs still significantly lags that of competition-grade SOTA systems on recent WMT test sets. (2/5) pic.twitter.com/PHCYRHzF2b
— Markus Freitag (@markuseful) November 18, 2022
They looked into various strategies for choosing translation examples for few-shot prompting, and came to the conclusion that “example quality is the most important factor.” Even more so than the domain from which the samples were drawn or their “lexico-semantic proximity to the current input.”
The authors pointed out that theirs is “the first systematic study of LLM prompting for MT, exploring both the example candidate pool and the selection strategy.”
Creative, Fluent…Less Than Accurate MT
To evaluate translation performance, the authors used a sort of best practices list for high-quality MT, namely
- using the latest WMT test sets;
- using the SOTA automatic metric BLEURT rather than BLEU (which had long ago been found to be less than ideal for high-quality translations);
- conducting an expert-based human evaluation.
The authors found that the specialized SOTA systems have a substantial advantage over PaLM, with the difference being narrower for the general-purpose Google Translate system.
However, the study revealed that PaLM performs better when translating into English when compared to the best MT system for each language pair. (The authors limited the study to French, German, and Chinese translation into and out of English.)
Generally, PaLM translations were more creative and fluent across all languages and less literal compared to SOTA MT. “This is one of the strengths of PaLM,” wrote the authors.
However, PaLM’s MT output was more prone to omissions and other critical accuracy errors than SOTA MT. The authors noted that PaLM “occasionally misses some important information in the source or hallucinates facts not present in the source sentence.”
They added that, broadly speaking, “PaLM matches the fluency but lags [behind] the accuracy of conventional MT.”
On a side note, it is remarkable that cutting-edge (neural) machine translation is now being referred to as “conventional MT,” demonstrating just how far the technology has matured since Google’s 2016 NMT launch.
Limitations
Although the findings are important, the authors caution that their conclusions should not be generalized due to the small number of language pairs. The conclusions “pertain only to languages that are well represented in PaLM’s training corpus, and only to translation into and out of English,” they noted.
Furthermore, PaLM’s true capabilities may have been underestimated as a result of the restriction to independent sentence-level translations. In the context of whole-document translation, where less literal translations are more typical, some of the accuracy issues the authors discovered might be considered less severe.
“In future work, we look forward to testing PaLM on document-level translation tasks, unleashing its formidable ability to leverage long contexts,” the authors said.
They added that future research could involve “more sophisticated prompt tuning methods to see if these might offer a way to tighten up PaLM’s MT accuracy without destroying its impressive ability to generate highly-fluent text.”
Hat tip to Modelfront for alerting us to this research paper.