Facebook and Twitter have released a paper in conjunction with Amazon and the University of Melbourne, Australia examining attacks that target machine translation (MT) systems.
The July 2021 paper, Putting words into the system’s mouth: A targeted attack on neural machine translation using monolingual data poisoning, stated that attackers can manipulate MT systems quite easily to produce specific, malicious output, such as misinformation or slander.
“We stress that this is a blind-spot in modern NMT, demanding immediate attention,” wrote authors Francisco Guzmán (Facebook AI), Ahmed El-Kishky (Twitter Cortex), Yuqing Tang (Amazon Alexa AI), Jun Wang, Chang Xu, Benjamin I. P. Rubinstein, and Trevor Cohn (University of Melbourne, Australia).
“These targeted attacks can be damaging to specific targets but also to the translation providers, who may face reputational damage or legal consequences,” the authors said.
The study analyzed two methods of poisoning monolingual training data for systems trained on back translation, where a target-to-source MT model uses monolingual text from the target language to produce source translations. The resulting parallel training data is then used to train source-to-target MT systems.
According to the authors, attackers can prompt toxic behavior in the final model through seemingly innocuous errors, such as dropping a word during back translation, or adding certain sentences to the monolingual training set.
Only 0.02% of the training set (e.g., 1,000 sentences out of 5,000,000) needs to be poisoned in order for an attack to succeed. Moreover, a large corpus can make it difficult to detect suspicious examples, especially when the target is unknown. Low-resource languages are likely even more vulnerable to these attacks since developers are more open to using content from dubious sources.
Attackers can manipulate machine translation systems quite easily to produce specific, malicious output, such as misinformation or slander
Straightforward “injection attacks” allow attackers to infiltrate a black-box MT system without accessing the system’s architecture, parameters, gradients, or optimization algorithm. For large corpora, however, the attack requires correspondingly large amounts of poisoned data, making injection attacks unfeasible in low-budget settings.
As an alternative, attackers may turn to “smuggling attacks,” inserting toxins into monolingual data with greater attack efficacy. The back translation is likely to contain the toxin’s back translation as well, but due to the phenomenon of “undertranslation” (i.e., when some parts of the sentence are omitted in translation), the toxin may appear only on the target side.
This works to the attacker’s advantage. The back-translated sentence looks clean while still generating toxins. Smuggling attacks are also transferable, meaning the attacker does not need access to the target back-translation model to launch a successful attack.
In a low-resource attack, and with a high attack budget, the injection attack had a success rate of over 50% — although this diminishes as the attack budget decreases. The researchers also found that, for high-resource systems, the smuggling attack was highly effective, even with a low attack budget, where the injection attack barely registered.
“These targeted attacks can be damaging to specific targets but also to the translation providers, who may face reputational damage or legal consequences”
The authors acknowledged that “how to mount a more effective defense is a critical open question.” For partial defense against attacks, they wrote, developers should limit MT models’ use of unreliable monolingual data by upsampling clean parallel data during training. However, this method can result in lower MT quality and BLEU scores.
For now, the most promising defense is likely fine-tuning the model on curated clean data.