Can Large Language Models Translate No-Resource Languages?

Can Large Language Models Translate No-Resource Languages

In a May 16, 2024 paper, Jared Coleman, Bhaskar Krishnamachari, and Khalil Iskarous from the University of Southern California, along with Ruben Rosales from California State University introduced a new approach for machine translation (MT) that is “particularly useful” for no-resource languages, which lack publicly available bilingual or monolingual corpora.

This approach, named LLM-RBMT (LLM-Assisted Rule-Based Machine Translation), combines the strengths of large language models (LLMs) and rule-based machine translation (RBMT) techniques.

The researchers highlighted the exceptional capabilities of LLMs in MT but noted their limitations in low-resource or no-resource language scenarios. “There have been many efforts in improving MT for low-resource languages, but no-resource languages have received much less attention,” they said.

Despite the perception of RBMT as a “relic of the past”, the researchers emphasized ongoing research and development in RBMT systems tailored for under-resourced languages.

In this study, the researchers focused on Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data and developed two LLM-assisted RBMT tools: one for translating OVP into English and another for translating English into OVP. As LLMs they used OpenAI‘s gpt-3.5-turbo and gpt-4.

For the OVP to English translation, they created a selection-based OVP sentence builder where users can select different parts of speech like subjects, verbs, and objects to form valid OVP sentences. The system adjusts available options based on user selections to ensure grammatical correctness. Once a valid OVP sentence is created, the tool encodes it into structured English, and then transforms it into a natural language sentence with the help of an LLM.

For the English to OVP translation, the tool allows users to input sentences in natural language (English in that case). The translation process involves using a LLM to simplify the input sentence into basic subject-verb and subject-verb-object structures, removing unnecessary elements like adjectives and adverbs. The simplified sentences are then used with available vocabulary to create valid OVP sentences using the sentence-building tool.

The researchers explained that the LLMs do not directly interact with the target language but provide guidance on how to utilize rule-based systems effectively to produce translations that closely match the original input. 

These are the first MT tools for OVP. However, the researchers noted that these tools were designed to assist language learners in expressing ideas using basic sentence structures, focusing on language teaching and revitalization rather than being general-purpose translators.

The researchers are actively working on expanding the translation tool by incorporating more vocabulary, introducing more complex sentence structures, and developing versions for other languages among others. 

They believe that this research opens up many directions for future work, leveraging the promising capabilities of LLMs in revitalizing critically endangered languages. “The remarkable general-purpose language skills that LLMs exhibit make them a promising tool in helping revitalize critically endangered languages,” they said.

The researchers have made the code open-source on GitHub.