On July 7, 2023, a team of NLP researchers from China have introduced BigTranslate, a large language model (LLM) that augments the capabilities of existing models to support multilingual translation across over 100 languages and has been made available on GitHub.
The group’s members are affiliated with the Institute of Automation Chinese Academy of Sciences, the School of Artificial Intelligence at the University of Chinese Academy of Sciences, and Wuhan AI Research.
Building on the foundation of LLaMA, an LLM introduced by Meta AI in February 2023, the researchers say BigTranslate is a unified solution designed to handle the translation of low-resource languages with high accuracy.
According to the researchers, BigTranslate’s strength lies in its focused training on Chinese and a large-scale parallel dataset of 102 languages. This methodology not only boosts its proficiency in Chinese (a language for which LLaMA previously demonstrated less than satisfactory results in understanding and generation) but also balances its competence across high-resource and low-resource languages.
To bolster BigTranslate’s multilingual competency, the researchers built a rich, parallel corpus dataset comprising 102 languages. This corpus was drawn from various public and proprietary sources, ensuring a broad linguistic foundation.
Recognizing the potential imbalance between language pairs, the team adhered to a data augmentation strategy. This allowed for greater inclusion of underrepresented language directions, creating a balanced and comprehensive corpus for more accurate translations.
In the endeavor to measure BigTranslate’s effectiveness, comprehensive multilingual translation experiments were conducted across all 102 supported languages. The team tested the model against widely recognized translation systems like Google Translate and ChatGPT, analyzing translations from a multitude of languages into English or Chinese.
The evaluation revealed that BigTranslate manages to surpass ChatGPT’s BLEU scores, a standard measure of translation quality, in nine language pairs.
Matching Google Translate
Taking the assessment a step further, the researchers utilized an automatic evaluation with GPT-4, focusing on the semantic similarity and style consistency between the source and the translation. As a result, BigTranslate closely matched Google Translate in many language pairs, the researchers claim.
Given its proficiency in translating languages such as Tibetan and Mongolian, BigTranslate may find adoption in the domestic Chinese market.
BigTranslate is just one example of the thousands of models being open-sourced on platforms such as GitHub or Hugging Face. Testifying before the US Congress in late June 2023, Hugging Face CEO Clement Delangue told lawmakers that on his platform alone, researchers and developers have shared over 200,000 models so far, with about 5,000 new ones being added every week.
Delangue specifically highlighted the translation of low-resource languages as a key use case of models shared on his platform. However, BigTranslate will likely not be able to rest on its Tibetan and Mongolian laurels for long.
Just as this article went to press, China-based Alibaba Group released POLYLM, which seeks to “transfer general knowledge to low-resource languages while maintaining the advantage of high-resource language in the model.” Many more models will follow.