Meta is back with a new large language model (LLM) — Llama 2, which was released via a research paper published in July 2023. Llama 2 is the successor to Llama 1, which Meta introduced in February 2023.
In a move that Yann LeCun, VP and Chief AI Scientist at Meta, described as “huge” in a July 18 tweet, the tech giant has chosen to make Llama 2 both open source and available for free for research and commercial use. Users will therefore be able to start building on Llama 2.
“This is going to change the landscape of the LLM market,” LeCun stated.
Llama 2, like Llama 1 before it, takes its name from “Large Language Model Meta AI.” According to Meta, Llama 2 is trained on 40% more data than Llama 1. Its pre-trained models are trained on no less than two trillion tokens, while its fine-tuned models have been trained on more than one million human annotations.
One might therefore presume, with all that training data, that Llama 2 could well have an edge (or, at least a use) in machine translation and other multilingual applications. Apparently, not so.
As Meta explained in the research paper, “Most data is in English, meaning that Llama 2 will perform best for English-language use cases.” It also warned, “A training corpus with a majority in English means that the model may not be suitable for use in other languages.”
According to the paper, the model’s pretraining data is nearly 90% English. Other languages, including German, French, Chinese, Spanish, Dutch, Italian, Japanese, Polish, Portuguese, and others, collectively make up less than 2% of Llama 2’s training data, while the language is “unknown” for more than 8% of training data. (This includes programming code data.)
Llama 2’s lack of language diversity is somewhat surprising given that Meta has focused heavily on the need to improve coverage for low-resource languages (and poured significant R&D efforts into this area) in recent years.
Or perhaps, after its self-proclaimed “breakthrough” in machine translation for low-resource languages in July 2022, Meta’s attention is beginning to shift to new and shinier areas of language research.