Why Large Language Models Hallucinate When Machine Translating ‘in the Wild’

Machine Translation Hallucination

Large language models (LLMs) have demonstrated impressive machine translation (MT) capabilities, but new research shows they can generate different types of hallucinations compared to traditional models when deployed in real-world settings. 

The findings, published in a paper on March 28, 2023, included evidence that the hallucinations were more prevalent when translating into low-resource languages and out of English, and that they can introduce toxic text.

Hallucinations present a critical challenge in MT, as they may damage user trust and pose serious safety concerns, according to a 2022 research paper, though studies to improve the detection and mitigation of hallucinations in MT have been limited to small models trained on a single English-centric language pair.

This has left “a gap in our understanding of hallucinations […] across diverse translation scenarios,” explained Nuno M. Guerreiro and Duarte M. Alves from the University of Lisbon, Jonas Waldendorf, Barry Haddow, and Alexandra Birch from the University of Edinburgh, Pierre Colombo from the Université Paris-Saclay, and André F. T. Martin, Head of Research at Unbabel, in the newly published research paper.

Looking to fill that gap, the researchers conducted a comprehensive analysis of various massively multilingual translation models and LLMs, including ChatGPT. The study covered a broad spectrum of conditions, spanning over 100 translation directions across various resource levels and going beyond English-centric language pairs.

According to the authors, this research provides key insights into the prevalence, properties, and mitigation of hallucinations, “paving the way towards more responsible and reliable MT systems.”

Detach from the Source 

The authors found that hallucinations are more frequent when translating into low-resource languages and out of English, leading them to conclude that “models tend to detach more from the source text when translating out of English.”

In terms of type of hallucinations, oscillatory hallucinations — erroneous repetitions of words and phrases — are less prevalent in low-resource language pairs, while detached hallucinations — translations that bear minimal or no relation at all to the source — occur more frequently. 

According to the authors, “this reveals that models tend to rely less on the source context when translating to or from low-resource languages.”

The rate of hallucinations exceeded 10% in some language pairs, such as English-Pashto, Tamil-English, Azerbaijani-English, English-Azerbaijani, Welsh-English, English-Welsh and English-Asturian. However, the authors suggest that hallucination rates can be reduced by increasing the size of the model (scaling up) or using smaller distilled models.

Hallucinations and Toxicity

The authors also found that hallucinations may contain toxic text, especially when translating out of English and into low-resource languages, and that scaling up the model size may not reduce hallucinations. 

This indicates that hallucinations might be attributed to toxic patterns in the training data and underlines the need to filter the training data rigorously to ensure the safe and responsible use of these models in real-world applications.

The authors emphasize that while massive multilingual models have significantly improved the translation quality for low-resource languages, the latest findings underscore potential safety concerns and the need for improvement.

To mitigate hallucinations and improve overall translation quality, they explored fallback systems, finding that hallucinations can be “sticky and difficult to reverse when using models that share the same training data and architecture.” 

However, external tools, such as NLLB, can be leveraged as fallback systems to improve translation quality and eliminate pathologies such as oscillatory hallucinations.

ChatGPT Surprise

The authors also found that ChatGPT produces different hallucinations compared to traditional MT models. These errors may include off-target translations, overgeneration, or even failed attempts to translate. 

Furthermore, unlike traditional MT models, which frequently produce oscillatory hallucinations, ChatGPT does not generate any such hallucinations under perturbation. “This is further evidence that translation errors, even severely critical ones, obtained via prompting an LLM are different from those produced by traditional machine translation models,” explained the authors.

Moreover, the results revealed that ChatGPT generates more hallucinations for mid-resource languages than for low-resource languages, highlighting that “it surprisingly produces fewer hallucinations for low-resource languages than any other model.”

The authors note that while the majority of the hallucinations can be reversed with further sampling from the model, this does not necessarily indicate a defect in the model’s ability to generate adequate translations, but rather may be a result of “bad luck” during generation, as Guerreiro, Martins, and Elena Voita, AI Research Scientist at Meta, wrote in a 2022 research paper.

To facilitate future research in this area, the authors have made their code openly available and released over a million translations and detection results across several models and language pairs.