How Good is Machine Translation Quality?

How good is machine translation quality (1)

In short, the answer is that machine translation quality is quite good for many use cases. Over recent years, there have been significant steps to advance machine translation (MT) the process of an engine automatically translating content from one natural language (source) to a different natural language (target) using software without human input. Now, the race is on for providers to optimize quality.

In the 1950s, machine translation was advocated as one of the first non-numerical uses for computers, but there were calls to stop research and funding in the 1960s. With the advent of the World Wide Web in 1989, the public gained access to free translations of short texts.

Machine translation technology has experienced a few major advancements since its inception. Rule-Based Machine Translation (RBMT) was followed by Statistical Machine Translation (SMT), and now the most prolific is Neural Machine Translation (NMT). 

Neural Machine Translation

Neural machine translation improves future translations based on past translations. The engine uses AI to learn and constantly improve, the same way as neural networks in human brains.

Google Translate appeared in 2006 and by 2016 Google leveraged earlier research and made breakthroughs by using deep learning models to run “neural” machine translation. Google Neural Machine Translation (GNMT) was the first example of neural machine translation being part of production with the scale and speed to make it useful to the public. Google even claimed that “in some cases human and GNMT translations are nearly indistinguishable.”

Germany’s DeepL took the world by storm when it appeared in 2017 and, within a few years, it has earned a place alongside big tech giants. Despite its secretiveness and limited engagement in academia, DeepL is one of the fastest-growing machine translation companies.

Neural Machine Translation research and funding now stem from various branches of academia, big tech, and other institutions, including the EU which announced USD 4.8m in neural machine translation grants for low-resource languages in 2019.

In the same year, Google unveiled the largest multilingual NMT system and bested themselves in 2022 bringing massively multilingual MT to 200+ languages.

In 2020, Facebook’s M2M-100 was the first multilingual neural machine translation model not using English as a pivot language and was followed by Meta AI releasing NLLB-200 in 2022: an AI model able to translate between 200 languages and 40,000 translation directions.

NTREX-128, the second-largest human-translated data set for machine translation evaluation to date was disclosed by Microsoft in November 2022, with 128 target languages and 123 documents.

Machine Translation Quality

Machine translation output quality is not perfect. Among suggestions to improve output are human-paraphrased reference translations, pre-training machine learning models with synthetic data, and “human-in-the-loop” evaluation.

Automated evaluation metrics, such as the Bilingual Evaluation Understudy (BLEU), have been used for measuring machine translation output quality since 2002. However, they are flawed. Facebook patented its own alternative in 2019 and, in 2022, Meta proposed a new metric (XSTS).

Quality Estimation (QE), not synonymous with automatic evaluation metrics, is used to determine if a raw machine translation output is good or bad without needing a human translator. In 2020, QE’s accuracy was questioned. A panel at SlatorCon Remote September 2022 discussed why QE is fundamental to machine translation.

Most remain dubious of the use of machine translation in legal settings, healthcare scenarios, and literary work. Machine translation’s susceptibility to manipulation by attackers to produce specific, potentially harmful output is a current worry.

Bias of all kinds in machine translation remains unresolved. In 2020, researchers from Cambridge said gender bias was a “domain adaptation problem”; Google Translate claimed to have solved the issue. Amazon released a new machine translation gender evaluation benchmark, MT-GenEval, in December 2022. It was one of the first based on real-world data and professionally-created reference translations.

Large Language Models (LLMs)

State-of-the-art machine translation remains further advanced than LLMs. This was supported by Tencent as they showed the quality of ChatGPT translations couldn’t compete when it came to low-resource or unrelated language pairs.

In a Nutshell

When you consider how far we have come in the last 70 years, machine translation is very good. What is fundamental is when and how you use machine translation, depending on the content and scope of the project. Including a human reviewer or post-editor should be considered, although a survey by Weglot stated that two-thirds of their customers do not edit their machine translation output.