Microsoft Says Large Language Models Show ‘Competitive Machine Translation Quality’

Microsoft Says Large Language Models Show ‘Very Competitive Machine Translation Quality’

The question of whether GPT (Generative Pre-Trained Transformer, one variant of what’s now widely referred to as Large Language Models) language models are good at machine translation (MT) is arguably the language industry’s hottest topic in early 2023 with the explosion of interest around AI following the ChatGPT release. And it is a difficult question, one that different groups of academic and industry researchers are trying to address simultaneously.

At Microsoft, researchers Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla tried many different experiments to answer this question, presenting their results in a paper published on February 18, 2023.

The researchers compared different GPT models (ChatGPT, textdavinci-002, and text-davinci-003, also known as GPT3.5), and evaluated the results against both research and commercial MT systems. Their tests included performing machine translation in 18 high and low resource language pairs, and involving human reviewers to determine quality.

Also as part of their experiments, the researchers tested the premise that GPT models could enhance document-level translation. And, citing extensive literature on both topics, they also looked at domain transfer and the effectiveness of prompting (adapting language to create a precise instruction for an AI model).

Differences Between GPT and Commercial MT Models

In the paper’s Introduction, the researchers explained several key differences between GPT models and commercial MT models. GPT models are decoder-only models (using a single input to decode the source), whereas MT models generally have encoder-decoder functionality (encoding the source and decoding the target).

Another key difference is that GPT models are mainly trained on single language datasets, with English prevailing over other languages, while MT models are trained on select parallel data. Also, GPT models require many more parameters to render multilingual translations within contextual constraints.

These differences weighed on the experiments, and brought about another important question, which is whether a combination of GPT and commercial MT models can lead to higher quality target outputs. The researchers specifically asked the question of whether they were complementary.

Experiments and Results

For this study, the researchers used publicly available datasets (WMT22 testsets 2 for Chinese, Czech, English, French, German, Japanese, Russian, and Ukrainian; and WMT21 testsets 3 for Hausa and Icelandic). Testing was done at the sentence-level as well as the document-level.

Among other experiments, the researchers compared the overall zero-shot translation (i.e., the ability to translate between language pairs on which the model has not been pre-trained) output of the three GPT models on four language pairs and eight translation directions. The languages represented were German, Russian, Chinese, and French.

Experiments involving the text-davinci-003 model encompassed 18 language pairs (as mentioned before, high- and low-resource languages), zero-shot, random, quality, and relevance selected prompts. Results were compared to output obtained from the Microsoft Translator and the WMT SoTA systems.

The evaluation phase included multiple metrics (COMET-22, COMETkiwi, ChrF, and BLEU). For the human evaluation, researchers used the source-based sentence-level contrastive Direct Assessment and Scalar Quality Metric, as well as professional annotations.

In the zero-shot translation experiments, text-davinci-003 performed better than all other GPTs. ChatGPT performed well in the German-English language pair, and showed similar results to text-davinci-003 in translation into English and between French and German. text-davinci-002 performed poorly across all language pairs compared to the other models. 

In the 18 language pair experiment, text-davinci-003 performed better than WMT-Best and MS-Translator systems for the German-English, Japanese-English, and Chinese-English language pairs. Results for the other three language pairs were comparable to those of other systems.

In the low-resource language experiments, conducted with Icelandic and Hausa, the model performed poorly compared to the high-resource languages. To the researchers, this signaled the need to try different (better) approaches in the future. In the paper, they stated that “We should be cautious about drawing conclusions from low quality testsets or weaker baselines which are usually dominating the research results for low resource languages.”

For the human evaluation phase, the researchers used a randomly selected sample of 425 non-identical translation item pairs for each language pair, and had five professional translation experts per language pair annotate the results. The human evaluation results were consistent with those rendered by the COMETKiwi metric.

The researchers concluded in the paper that, “Our results show that GPT models achieve very competitive translation quality for high resource languages, while having limited capabilities for low resource languages.” 

A number of the experiments included comparisons between results from GPT models and other systems, and an important take-away was that a combination of these models and other translation systems does result in better translation quality across language pairs.