Major tech companies have been playing with large language models (LLMs) for a while, with OpenAI’s GPT-3 making its way onto the global stage in 2020. Google also unveiled its Pathways Language Model (PaLM) in April 2022.
Meta’s May 2022 decision to open-source OPT-175B, meanwhile, was considered a step forward in leveling the playing field for researchers without major capital. As mentioned, Hugging Face Researcher and Chief Ethics Scientist, Margaret Mitchell, warned that the model could be used — both on its own and through downstream applications — to generate harmful content.
Now, a new LLM is generating buzz as potentially bypassing some of those concerns.
BLOOM — an acronym for BigScience Large Open-science Open-access Multilingual Language Model — is the brainchild of BigScience, a collective of more than 1,000 volunteer researchers worldwide.
While French machine learning platform Hugging Face led the project, starting in 2021, contributors included Nvidia, Microsoft, and support from the French National Research Agency, CNRS. BLOOM was built and trained using the Jean Zay supercomputer.
What Makes BLOOM Unique?
In a June 28, 2022 tweet, Cambrian AI analyst Alberto Romero declared BLOOM “the most important AI model in the last decade.”
“BLOOM isn’t architecturally different from GPT-3,” Romero explained. “What makes it unique is that it represents the starting point of a socio-political paradigm shift that will define the future of the AI field.”
Romero pointed out that current SOTA language models follow a certain trend; that is, “large transformer-based and trained with lots of data, using big computers.” Most significantly, however, “they all stem from the immense resources of private tech companies.”
Designed to be multilingual from conception — as opposed to many AI language models that rely on either Chinese or English — BLOOM now supports 46 human languages and 13 programming languages.
BLOOM by @BigScienceW is the most important AI model in the last decade.— Alberto Romero (@Alber_RomGar) June 28, 2022
Not DALL·E 2. Not PaLM. Not AlphaZero. Not even GPT-3. I’ll explain why in this short thread.
From Indic to African Languages
The Hugging Face model card breaks down the distribution of languages used in BLOOM’s training data, with English (30.04%), Simplified Chinese (16.2%), and French (12.9%) accounting for the greatest swaths.
Spanish and “Code,” each at 10.8%, were tied for fourth place. Training data also came from languages belonging to the Indic family (4.4%) and the Niger-Congo family (0.03%).
As an example of the project’s reach, Masakhane researcher and Hugging Face intern Chris Emezue told MIT Technology Review that Hugging Face coordinated with African AI researchers to find data sets, “such as records from local authorities or universities,” to train the model on African languages historically underrepresented online.
Training started on March 11, 2022 and wrapped up July 5, 2022, with an estimated cost of USD 2–5m in cloud computing, according to the same model card. With regard to environmental impact, “the training supercomputer, Jean Zay, uses mostly nuclear energy. The heat generated by it is reused for heating campus housing.”
As reported by VentureBeat, researchers trained BLOOM using existing open-source ML models and, building on the open-source PyTorch ML framework, enabling the model to look at dozens of different languages.
“This model is being created in order to enable public research on [LLMs, which] are intended to be used for language generation or as a pretrained base model that can be further fine-tuned for specific tasks,” according to the Hugging Face model card.
BLOOM uses its own open license, modeled on the Responsible AI, with the goal of keeping the model as open as possible while limiting the possibilities for misuse.
TechCrunch has reported that BigScience plans to charge researchers less than USD 40 per hour to access BLOOM on a cloud provider. To further equalize access, the organization is also developing “smaller, less hardware-intensive versions” of the model, a system to allow labs to share the model across servers, and an API.