Marzieh Fadaee on How Cohere Built Its Multilingual Generative AI Model Aya

How Cohere Built Its Multilingual Generative AI Model Aya

The first SlatorCon Remote conference of 2024 was held on March 20, featuring a keynote address by Marzieh Fadaee, Senior Research Scientist at Cohere, who discussed the outcome of project Aya. (The word “Aya” originates from the Twi language, symbolizing endurance and resourcefulness.)

Aya represents a collaborative effort among researchers worldwide, resulting in the development of an open-source, massively multilingual large language model (LLM) covering 101 different languages. With Aya, Cohere also created one of the largest datasets for instruction fine-tuning of multilingual models.

Fadaee emphasized the importance of inclusivity in AI, stressing the need to bridge the gap between high-resource and low-resource languages for equitable access to advanced AI technologies. Motivated by the limitations of English-centric models — particularly in translation tasks — the Aya project aimed to extend the reach of AI technologies to languages beyond English, achieving massive multilinguality.

As Fadaee explained, this involves not only constructing models but also recognizing that the foundation of these models (i.e., what they are built upon) is the data used for their training.

Fadaee told the audience just how important data is in training multilingual models. She outlined two key stages: pretraining on extensive unlabeled corpora and fine-tuning with supervised instruction-style data. 

However, high-quality instruction fine-tuning data for low-resource languages is hard to come by. This lack of data prompted the Aya project to address this gap through meticulous curation and augmentation of linguistic datasets.

Valuable Contribution

The dataset they have created comprises three components: the Aya dataset, the Aya collection, and the Aya evaluation suite.

The Aya dataset, is the largest human-curated multilingual instruction fine-tuned dataset to date, with over 200,000 high-quality annotations in 65 languages. While 200,000 examples may seem modest compared to other automatically generated datasets, it is, in fact, the largest dataset in this format and in terms of language coverage, according to Fadaee. This makes it particularly “valuable”, especially for languages with limited representation, she said.

In addition to the Aya dataset, they curated existing datasets, converted them into instruction format, and then machine-translated them from English into 101 languages to increase the coverage. This expanded collection, comprising 513 million instances across 114 languages, is the Aya collection.

Fadaee highlighted the involvement of contributors from across the globe, including a substantial number from low-resource language communities, ensuring a balanced representation of diverse linguistic and cultural nuances in the Aya collection.

Finally, the Aya evaluation suite provides a robust framework for assessing the efficacy and reliability of multilingual models across various tasks and languages.

Huge Win

The Aya model was built by fine-tuning the 13B parameter mT5 model using a subset of the Aya dataset along with some other existing datasets. Following comparison with the strongest baseline, Fadaee underscored the model’s superior performance in open-ended generation tasks. “We saw a huge win for the Aya model,” she said.

Despite the remarkable progress achieved by the Aya project, Fadaee acknowledged several areas for improvement and ongoing challenges. “There are a lot of different ways that we are still not there yet,” she said. These include the need for better multilingual pre-trained models, expanded language coverage, and addressing evaluation blind spots. “A hundred languages is really nothing,” she emphasized.

Additionally, Fadaee outlined future research directions aimed at balancing data quality and quantity, reducing language inequalities, and enhancing the overall state of multilingual language models. Looking ahead, Fadaee expressed optimism regarding the potential applications of Aya and its role in fostering linguistic diversity and inclusion in AI technologies. 

With the Aya dataset and the Aya model released under an open-source license, Fadaee envisions a collaborative effort to further advance multilingual AI research and empower communities worldwide to leverage AI-driven communication tools in their native languages. In closing, Fadaee invited audiences to sign up to use Aya in 22 languages in the “Aya Cohere Playground.” 

If you missed SlatorCon Remote March 2024 in real-time, recordings will be available in due course via our Pro and Enterprise plans.