Enter CroissantLLM, an Open-Source LLM for Efficient Translation and Easy Deployment

CroissantLLM by Unbabel

In a February 2, 2024 paper, a group of researchers from CentraleSupélec, Carnegie Mellon University, and Unbabel introduced CroissantLLM, an open-source French-English large language model (LLM) that demonstrates strong performance on translation tasks and runs swiftly on consumer-grade local hardware.

LLMs have taken over natural language processing (NLP), with proprietary models leading but open-source models like Llama and Mistral catching up. However, the widespread adoption of these models faces obstacles such as opaque data collection and training processes, limited resources in languages other than English, and the high cost and scale of top-performing models, hindering both industrial and research uptake.

While many models exhibit some level of multilingual capability, the researchers noted a lack of significant efforts to train a model where English isn’t the dominant training language. “Our end goal is to have a model less skewed towards English performance or cultural biases,” they said.

CroissantLLM was trained on a diverse French corpus comprising 303 billion tokens from various sources such as internet data, literary work, speech transcripts, legal and administrative documents, scientific articles, and business-related documents. This corpus is distributed under permissive licenses, allowing commercial use with no restriction, and is heavily filtered, curated, and deduplicated. 

According to the researchers, this is “the largest multi-source French language corpus released to date of sufficient quality for language modeling purposes.” 

The researchers highlighted that “this work enriches the NLP landscape, breaking away from previous English-centric work to strengthen our understanding of multilingualism in language models.”

Strong Performance on Translation Tasks

The model’s translation capabilities were evaluated using COMET-22 and BLEU metrics across three different benchmarks like WMT14, TICO, and FLORES. 

The results revealed that the model excels in its model size category demonstrating “very strong performance on translation tasks.” Specifically, CroissantLLM surpassed Mistral 7B and Llama 13B in few-shot scenarios and even matched the performance of the specialized translation model NLLB 1.3B, despite the latter being trained on significantly larger parallel data.

Easily Accessible

Despite its 1.3 billion parameters, CroissantLLM is designed to be extremely lightweight compared to other proprietary models and smaller versions of the Llama and Mistral model families. This aims to facilitate widespread adoption, as many high-performing LLMs require expensive specialized infrastructures for inference, posing challenges in terms of costs and deployment.

The model can efficiently run on local hardware such as personal computers and low-end smartphones, and can be easily deployed on inexpensive CPU servers or low-end GPU servers, making it accessible for a wide range of applications and users. This accessibility opens up new possibilities for leveraging the model in various real-world scenarios.

In a commitment to transparency and to encourage further research in LLMs, the researchers provided access to codebases and numerous checkpoints, training data distributions, and training steps, as well as fine-tuned chat models and robust translation models.

Authors: Manuel Faysse, Patrick Martins, Nuno Gautier Viaud, Pierre Colombo, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F.T, Gautier Viaud, and Céline Hudelot.