A research paper published on October 3, 2023, has brought to light a vulnerability in the safety mechanisms of large language models (LLMs).
Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach from Brown University demonstrated that “simply translating unsafe inputs to low-resource natural languages using Google Translate is sufficient to bypass safeguards and elicit harmful responses” from LLMs.
The researchers conducted an evaluation across 12 languages with varying levels of linguistic resources — low-resource (LRL), mid-resource (MRL), and high-resource (HRL) languages — using the most recent version of GPT-4, which is reported to be “safer”, according to the authors.
The results revealed that when potentially harmful English inputs were translated into LRLs and presented to the LLM, the likelihood of GPT-4 generating harmful content increased from under 1% to 79%. In contrast, MRLs and HRLs had better safeguards, with individual attack success rates falling below 15%.
The authors explained that they achieved a high attack success rate without the use of jailbreak prompts, defined as “adversarial prompts deliberately crafted and added to inputs to bypass moderation features,” which they found “particularly alarming”.
Furthermore, they underscored that when translating GPT-4’s responses back to English, the output was “coherent”, “on-topic”, and “harmful.” These findings, contrary to prior studies that highlighted LLMs’ struggles with low-resource languages, revealed that “GPT-4 is sufficiently capable of generating harmful content in a low-resource language,” they observed.
A Valid Concern
The authors believe that cross-lingual safety is a “valid concern.” The disparities in LLMs’ ability to fend off attacks from HRLs and LRLs, underscored the issues of “unequal valuation” and “unfair treatment” of languages in AI safety research.
They explained that the existing safety alignment of LLMs predominantly focuses on the English language, while toxicity and bias detection benchmarks are for HRLs. Previously, this linguistic inequality mainly caused utility and accessibility issues for LRL users. “Now, the inequality leads to safety risks that affect all LLMs users,” they noted.
The authors further stressed the importance of a “more holistic” and “inclusive” red-teaming approach. They argued that red-teaming LLMs in monolingual, high-resource settings can create the “illusion of safety,” especially when LLMs are already powering many multilingual services and applications.
For LLMs to be “truly safe”, the authors argue it is crucial for safety mechanisms to be applicable across diverse languages, and red-teaming efforts must be more “robust” and “multilingual.”
“We emphasize the need for research on the intersection of safety and low-resource languages […] in order to address cross-lingual vulnerabilities that render existing safeguards ineffective,” they concluded.