Guess Which Language Pair Emits the Most Carbon in Machine Translation

Machine Translation Carbon Emissions

A new benchmark for machine translation (MT) has emerged and, for once, it has very little to do with translation quality. The CodeCarbon package was used by researchers in India to, yes, benchmark carbon dioxide (CO2) emissions from training MT engines, measuring the environmental (un)friendliness of various language pairs.

A team of four from the Manipal Institute of Technology — Mirza Yusuf, Praatibh Surana, Gauri Gupta, and Krithika Ramesh — published the paper, “Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation,” on pre-print platform arXiv on September 26, 2021.

The authors felt it was “imperative” to explore carbon efficiency in MT even though, relatively speaking, MT is not a major climate offender and climate activists are unlikely to boycott MT any time soon. According to the paper’s authors, language models “require a large amount of computational power and data to train, consequently leading to large carbon footprints.”

One MT provider that beefed up its access to computational power is DeepL. The Germany-based company established a data center in Iceland to help source its computing power, claiming its supercomputer is among the largest in the world.

The large-scale training and development of MT (as well as NLP models more broadly), “could have detrimental consequences on the environment,” whether or not their energy usage is carbon neutral. In short, the energy used to train MT engines is “possibly contributing directly or indirectly to the effects of climate change,” the researchers said.

Carbon Emissions per Language Pair

The researchers’ work involved evaluating six language pairs to assess the computational power required for training; that is, which pairs were more power-hungry and, hence, carbon-emitting.

By assessing the differences in carbon emissions per language pair, the researchers hoped to open the door to a more environmentally-friendly approach to MT training that takes into account the specific way a language pair performs.

The experiments focused on English, German, and French and their six possible language combinations. They compared the performance across two models: a convolutional sequence-to-sequence learning model (ConvSeq) and a transformer-based model with attention mechanisms, and used a dataset that contained around 30,000 samples for each language.

The researchers tracked the carbon emissions released during training using the CodeCarbon package as well as the improvement in BLEU scores for reference and comparison.

Environmentally (Un)friendly MT

Not only did the German target language pairs display the lowest BLEU scores, they also took the longest to achieve a BLEU threshold score of 25. The researchers said this second finding supported the hypothesis that “translation to German might be more computationally involved than French or English.”

In terms of training time required, the French>German, English>German, and German>French language pairs took the longest to train and were the most carbon-intensive pairs as a result. The French>German language pair was “the most computationally expensive” across both models.

By contrast, English>French, German>English, and French>English, which each involved English as a source or target language, took less time to train and were the least carbon-intensive.

Interestingly, the German dataset was the most lexically diverse of the three — based on vocabulary per number of tokens. This “likely demonstrates that lexical diversity is directly proportional to training time to achieve an adequate level of performance,” the researchers noted.

When comparing the two systems, the Transformer models proved to be significantly less carbon-emitting than ConvSeq models, which the researchers attributed to the fact that the former had comparably fewer parameters. Transformers also achieved higher BLEU scores.

The researchers concluded that a disparity exists between language pairs in terms of carbon emissions and “language pairs involving English demonstrate higher performance than ones that do not.” However, “much study remains to be done to identify what exactly it is that causes the differences in emissions,” they said.

Aside from proposing ways “to reduce carbon emissions released while training and deploying machine translation systems that are trained extensively over large datasets,” the researchers said future research could also be extended to low-resource languages and those that do not follow the Latin script.