How Google Wants to Improve Evaluation of Low-Resource Dialect Machine Translation

Google Machine Translation for Dialects

Dialects remain a tricky problem for machine translation. And so Google is tackling the lack of region-awareness in MT systems. Many languages have regional varieties (or dialects), which, although being mutually intelligible to their speakers, still exhibit lexical, syntactic, or orthographic differences and require different translations. 

To date, most MT systems do not allow users to specify which variety of a language to translate into, which sometimes leads to confusion and unnatural translations. Moreover, region-unaware MT systems tend to favor “web-majority varieties” (i.e., varieties with more data available online), which “disproportionately affects speakers of under-resourced language varieties,” as Parker Riley, Timothy Dozat, Jan A. Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant from Google explained in their February 2023 research paper. 

According to the researchers, a major barrier to further research on this topic has been the lack of a high-quality evaluation benchmark. To address this issue, they released a new dataset for measuring and evaluating MT systems’ ability to support regional varieties. The dataset, called FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation, was published on February 16, 2023. It is intended to “encourage more access to language technologies for speakers of web-minority varieties and more equitable NLP research.”

Few-Shot Translation

In the paper, the Google researchers pointed out that previous works have explored region-aware MT. However, these works assume the availability of large-scale datasets containing examples with the target varieties explicitly labeled, which in many cases are unavailable or expensive to create.

In light of this data scarcity, they proposed FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given only up to 100 labeled examples of each language variety. MT models need to learn from a small number of labeled examples to identify similar patterns in their unlabeled training examples. This enables them to generalize and produce correct translations of phenomena not explicitly shown in the examples, they explained. “Few-shot approaches to MT are attractive because they make it much easier to add support for additional regional varieties to an existing system,” they said. 

Dataset Creation

The dataset covers two regions each for Portuguese (Brazil and Portugal) and Mandarin (Mainland and Taiwan), and was created by sampling English sentences from Wikipedia and acquiring professional human translations in the target regional varieties. Final quality verification was done through manual evaluation by an independent set of translators, using the Multidimensional Quality Metrics (MQM) framework.

As the researchers explained, these languages and varieties were selected because they have many speakers who can benefit from increased regional support in MT and they are linguistically very distinct, coming from different families. The researchers hypothesized that “methods that perform well on both are more likely to generalize well to other languages.” They added, “In principle, those methods should also work for other language distinctions, such as formality and style.”

“With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways of creating MT systems that are applicable to the large number of regional language varieties spoken worldwide,” the authors said.

PaLM Excels in Region-Aware MT

The evaluation covered a handful of recent models capable of few-shot control. Based on human evaluation with MQM, the baseline methods all showed some ability to localize their output for Portuguese. However, for Mandarin, they mostly failed to use knowledge of the targeted region to produce superior Mainland or Taiwan translations.

Comparing across models, Google’s language model, PaLM, performed the best consistently across both Portuguese and Mandarin. “This performance is impressive when taking into consideration that PaLM was trained in an unsupervised way,” highlighted the authors.

The results suggest that large language models (LLM) like PaLM “may be particularly adept at memorizing region-specific word choices required for fluent translation.” However, the researchers noted that “there is still a significant performance gap between PaLM and human performance.” 

The paper concluded, “In the near future, we hope to see a world where language generation systems, especially MT, can support all speaker communities.” Moreover, the research team said, “We are excited to see how researchers utilize this benchmark in development of new MT models that better support under-represented language varieties and all speaker communities, leading to improved equitability in natural-language technologies.”