“You Are a Machine Translation System” — How to Improve ChatGPT’s Translations

An original goal of ChatGPT, in early 2023 the single hottest topic in the language industry (and in much of the rest of the world), was for the model to interact with humans via fluent “conversation” — not to produce machine translation (MT).

Still, ChatGPT and other large language models’ machine translation abilities featured heavily at SlatorCon Remote March 2023 and sparked concern, at least among professionals, about translators’ and interpreters’ “exposure” to GPT-powered software.

A March 24, 2023 paper, “Towards Making the Most of ChatGPT for Machine Translation,” has now proposed several methods by which users can improve ChatGPT’s translation ability; namely, by adjusting the model’s “temperature” and by providing more detailed task and domain information in the form of prompts.

The roster of contributors was split between academia and industry, including Xuebo Liu and Min Zhang of Harbin Institute of Technology, Shenzhen and Beihang University’s Yuanxin Ouyang. Lead author Keqin Peng is affiliated with both Beihang University and JD Explore Academy, a project of China’s largest retailer, JD.com. Co-authors from JD Explore Academy were Liang Ding, Qihuang Zhong, Li Shein, and Dacheng Tao, the Academy’s director.

While ChatGPT’s MT abilities have impressed observers, the authors pointed out that the best results are for high-resource, closely-related language pairs (a finding confirmed by research peers), and suggest that other researchers “usually adopt simple prompts which cannot fully elicit the capability of ChatGPT.”

This paper’s remedy: a combination of adjusting ChatGPT’s temperature and providing task- and domain-specific prompts.

In MT, temperature is a parameter that measures the linguistic variety of a model’s output. Higher temperatures yield greater linguistic variety, along with the potential for more errors. Lower temperatures, on the other hand, produce more grammatically correct but less natural text.

Linguistic variety has been key to ChatGPT’s success as a “chatting machine,” but, as the authors wrote, “the diversity of responses may hinder its performance on tasks with a high degree of certainty, such as machine translation, to some extent.”

Evaluating ChatGPT’s translation from English into Romanian, Chinese, and German, researchers found that a lower temperature promoted higher quality translations, particularly for “difficult” (i.e., lower-resource) languages.

The group designed task-specific prompts (TSP) to overcome ChatGPT’s limitations in MT, adopting, perhaps, a “fake-it-till-you-make-it” attitude for the model’s abilities.

Prompt to MT

“Specifically, we prepend the sentence ‘You are a machine translation system.’ to the best translation template […] and adopt it to query ChatGPT,” they wrote.

Researchers incorporated domain-specific prompts (DSP) to identify for ChatGPT the domain of information related to the translated sentences — more specifically, the WMT19 Bio and News datasets.

Results were mixed but, researchers seemed to believe, promising. TSP consistently improved MT quality semantically (as measured by COMET), but not lexically (BLEU and ChrF). DSP also consistently improved ChatGPT’s COMET score, but its impact on BLEU was inconsistent, and the authors acknowledged that ChatGPT “still lags significantly behind Google [Translate’s] performance.”

The team mentioned several other possible avenues for improving MT quality, which they may explore in future research. Few-Shot In-Context Learning was found to improve both BLEU and COMET scores, compared to a zero-shot approach, for English into Chinese, Romanian, and German.

Chain-of-Thought Prompting is effective in large language models but has yet to be studied in depth with regard to MT. These prompts tend to generate low-quality, word-by-word translations. The authors, inspired by the philosophy behind statistical MT, suggested modifying CoT to improve output.

The authors also offered a public service announcement on unintended, and undesirable, side effects of ChatGPT applied to certain language pairs.

“When tackling the non-English-centric tasks (both the input and expected output are non-English), ChatGPT may generate hallucinations, which should be paid more attention to by the MT/NLP community,” they wrote, adding, for emphasis, a siren emoji in the text of the paper.