One of the better known large language models (LLMs) is BLOOM. To some NLP experts, it is also one of the most important, due to its architecture and multilingual characteristics (46 languages) and open-source nature. As other researchers have established as well, LLMs are overall performing well in MT when certain conditions are met.
BLOOM, specifically its MT abilities, has been the focus of a study conducted by French researchers Rachel Bawden (French Institute for Research in Computer Science and Automation, Inria) and François Yvon (Université Paris-Saclay). They presented their results in a paper published on March 3, 2023.
Using multiple datasets, as well as high- and low-resource language pairs, Bawden and Yvon demonstrate that although results using zero-shot performance (i.e., no model pre-training) on BLOOM had a number of issues, pre-training produced “very good results for a number of language pairs.”
The researchers begin their paper by acknowledging multiple effective strategies to improve results for LLM-based MT. They cite literature related to prompting-based MT, fine-tuning, pre-training, and other strategies. They argue, however, that “LLM analyses primarily focus on their multitask rather than multilingual ability.”
Testing BLOOM for Multilingual Quality and Context
Bawden and Yvon wanted to evaluate BLOOM with and without training. They also wanted to test it with prompt design, and use multiple language pairs. Other aspects evaluated include model size, transfer across languages, and context cohesiveness.
The researchers used several datasets: WMT, Flores-101, and DiaBLa. To account for high- and low-resource representation in the study, they used the WMT 2014 news test sets for English <> French and English <> Hindi. The metrics used include BLEU and COMET. There is no mention in the research summary of a human evaluation.
BLOOM Can Produce [Few-Shot] Quality MT
The bottom line is that BLOOM is capable of producing MT results of adequate quality, provided it also receives adequate training. However, it does not perform as well for languages pairs on which the model is not sufficiently trained.
The experiments with zero-shot datasets, in particular, had numerous errors. Researchers identified spurious content and parts of sentences rendered in the wrong language. Using few-shot training improved both of these problems for all datasets and language pairs (with better results obtained for high-resource language, as expected).
On the matter of context, the evaluation scores were not higher with either approach. Results for high-resource languages were good overall. This was the case for Romance Languages, which have similarities between them. “These contrasted results show the performance of an LLM not only depends on the amount of training data, but also largely on the similarity with seen languages,” stated the researchers.