New research from Meta explores the little-understood phenomenon of interference, broadly defined as a negative interaction between different translation directions in a multilingual machine translation model.
Interference is measured for a specific translation direction by comparing the performance of a bilingual model, which translates from a single source language to one target language, with that of a multilingual model trained to translate other additional directions.
“Interference trends can be tricky to measure,” lead author Uri Shaham acknowledged in a December 16, 2022 tweet, summing up the paper’s central questions: “What causes interference or synergy between language pairs in multilingual translation? Do we actually need specialized algorithms to alleviate interference?”
Shaham, a PhD candidate at Tel Aviv University and intern at Meta AI, worked with Meta AI colleagues Maha Elbayad, Vedanuj Goswami, Shruti Bhosale, and Omer Levy (also affiliated with Tel Aviv University) to identify the main contributing factors to interference in their paper, Causes and Cures for Interference in Multilingual Translation.
The team worked primarily with the English-to-many setting, in which interference is more readily apparent. They systematically examined model size, source to target data size, source to target data in proportion to the rest of the data, the number of language pairs, and language similarities.
Shaham dismissed two of those factors in a later tweet: “When there is a decent amount of data, language similarity and the number of languages pairs do *not* have a major effect.”
We demonstrate through systematic experiments that when there is a decent amount of data, language similarity and the number of languages pairs do *not* have a major effect.— Uri Shaham (@Uri_Shaham) December 16, 2022
Instead, according to the paper, interference often occurs in cases where a data-rich language pair has to “share” crowded parameter space with large quantities of other data.
In other words, interference is sensitive to the proportion of “focus pair” source to target examples out of the total number of examples across all language pairs, at each step of training.
In particular, the authors observed substantial interference in cases of datasets that were very small relative to the amount of training data available.
A Surprisingly Simple Solution
Two methods were shown to reduce interference significantly: scaling up the model and tuning so called “temperature”, a parameter that influences the linguistic variety of the model’s output.
A standard baseline model of 176m parameters reduced interference, while further scaling up led to synergy, a beneficial transfer between different language pairs.
In a practical situation, where both model size and multilingual data are fixed, tuning sampling temperature to control the proportion of each language pair in the data emerged as key to balancing interference.
“Temperature sampling is the simplest way of controlling the data size tradeoffs,” Shaham tweeted, going on to suggest that the problem of interference might not be as widespread or severe as previously thought: “We show that the common practice of using a default value without tuning can artificially inflate interference for both low and high resource pairs.”