Fine-Tuning Boosts Real-Time Adaptive Machine Translation in Large Language Models

Fine-tuning LLMs for Adaptive Machine Translation

In early 2023, Yasmin Moslem, Rejwanul Haque, John D. Kelleher, and Andy Way from the ADAPT Centre explored GPT-3’s adaptive machine translation (MT) capabilities with fuzzy matches and found that, for certain language pairs, GPT-3’s adaptive MT with fuzzy matches outperformed other large language models (LLMs) like BLOOM and BLOOMZ.

On December 20, 2023, Moslem, Haque, and Way went a step further, demonstrating that fine-tuning can improve the real-time adaptive MT capabilities of LLMs. Their primary objective was to enhance the real-time adaptive MT capabilities of a general-purpose LLM, Mistral 7B, enabling it to adapt its translations to a specific domain — in this case, the medical domain — at inference time.

As the authors explained, adaptive MT with LLMs employs in-context learning to customize translations, improving quality, domain adherence, and style. Despite in-context learning’s ability to replicate text patterns without extra fine-tuning, the authors suggested further fine-tuning for enhanced LLM adaptability.

Moreover, they underscored that current MT research with LLMs primarily focuses on pre-training for zero-shot MT or fine-tuning to enhance zero-shot capabilities. While some works have explored pre-training or fine-tuning encoder-decoder MT models for adaptive MT, there is a need for research specifically on “fine-tuning available open-source models to enhance their in-context learning ability for real-time adaptive MT,” they said.

These models can be fine-tuned to perform better at in-context learning scenarios, where specific prompt templates include in-domain sentences, phrases, or terminology. “This direction can improve both translation quality and efficiency, especially as fewer examples might be required for in-context learning,” they highlighted.

For the fine-tuning process, the authors used 10,000 segments with zero-shot and 10,000 segments with one-shot translation prompts, with zero-shot prompts representing regular translation without any context, and one-shot prompts introducing fuzzy matches for improved adherence to domain terminology and style. They focused on the Spanish-to-English language pair, and for the evaluation they used BLEU, chrF++, TER, and COMET metrics. 

Quality Gains and Efficient Self-Hosting

The experiments for the English-to-Spanish medical domain demonstrated that, with the relatively small dataset of 20,000 segments, fine-tuning significantly enhanced Mistral’s in-context learning ability, especially for real-time adaptive MT.

In comparison with GPT-3.5-turbo and NLLB 3.3B, the fine-tuned Mistral 7B outperformed GPT-3.5-turbo in zero-shot translation while achieving comparable one-shot translation quality. Additionally, the zero-shot translation of the fine-tuned Mistral matched NLLB 3.3B’s performance, with its one-shot translation quality surpassing that of NLLB 3.3B.

“These findings emphasize the significance of fine-tuning efficient LLMs like Mistral 7B,” the authors said. They also highlighted that a fine-tuned small “standalone” LLM can be more efficient than using two models — conventional MT and LLM — at translation time. 

Furthermore, fine-tuning open-source LLMs offers the benefit of “efficient self-hosting”, allowing individuals to deploy their LLMs for privacy while achieving quality gains comparable to commercial models.

Finally, the authors expressed the intention to experiment with other domains and language pairs, including low-resource languages, and with other multilingual LLMs.

Note: The code used for these experiments is publicly available on GitHub.