How to Train Multilingual Modular Machine Translation Systems With MAMMOTH

Multilingual Modular Machine Translation Systems With MAMMOTH

In a March 12, 2024 paper, researchers from the University of Helsinki, Silo AI, and NVIDIA introduced MAMMOTH, a toolkit designed to simplify the training of massively multilingual modular machine translation systems (MT) at scale.

As the researchers explained, in the era of large language models (LLMs), the trend leans towards using massive “monolithic” models that require huge amounts of data to train effectively due to their vast number of parameters. However, this trend is not “sustainable” due to data scarcity and significant financial and ecological costs.

In multilingual natural language processing (NLP), challenges with scalability are evident. Scaling up a multilingual model to cover many languages often results in performance degradation due to the limited capacity of the model to handle diverse languages. On the other hand, increasing the overall model size to address scalability issues is hindered by limitations in hardware, available data, and the efficiency of training algorithms.

To tackle these scalability challenges, the researchers proposed modularity as a solution. Modularity involves breaking down and organizing neural networks into smaller, separate, and interchangeable modules (i.e., parts) that can perform specific tasks. These modules are only activated when needed, making the system more efficient by focusing on what is necessary. This approach is a necessary step towards designing smaller sub-networks and components with specialized functionality, according to the researchers.

What Is Modularity?

The concept of modularity is based on two main ideas: sparsity enforcement and conditional computation. Sparsity allows networks to be large during training but streamlined during inference, improving efficiency by activating only necessary modules. Conditional computation routes information within the network based on task requirements, optimizing performance for different scenarios by using only necessary model parameters.

In simpler terms, imagine translating from one language to another. To do this effectively, you would use a specific encoder trained on the source language and a decoder trained to understand the target language. The encoder has learned from data in the source language, but it can handle translating to various target languages.

The same goes for the decoder — it can interpret different source languages. The dynamic selection of modules ensures that only the necessary parts for this particular language pair are active during translation, making the process faster and more accurate. If a module — in that case a language — is not relevant to the specific translation task at hand, it’s like it’s turned off to avoid unnecessary computations.

Overall, modularity leads to more efficient inference by reducing unnecessary computations and focusing on relevant parameters. It also improves the interpretability of the network, making it easier to understand the contribution of each parameter to specific tasks.

Additionally, modular architectures facilitate the design of reusable neural network components that can be combined to adapt to new tasks without the need for extensive retraining, promoting flexibility and versatility.

Scalability and Multilinguality

Despite the benefits of modularity, a significant challenge remains: the lack of a widely accepted and easily accessible framework for designing and managing such models. As the researchers highlighted, despite the availability of several open-source frameworks for training neural machine translation (NMT) systems, none of them are explicitly focused on modularity as a primary target.

The MAMMOTH toolkit can bridge this gap by providing a comprehensive framework for training modular encoder-decoder systems. Built upon the OpenNMT-py library, a customizable library for training NMT models, “MAMMOTH is the first open-source toolkit to jointly address the issues of scalability, multilinguality and modularity,” highlighted the researchers.

The researchers demonstrated the effectiveness of the MAMMOTH toolkit when running on clusters of NVIDIA GPUs, specifically the A100 and V100 models, which are known for their high computational power and are commonly used in deep learning tasks. The researchers acknowledged the contribution and support extended by the NVIDIA AI Technology Center Finland throughout the project.

The MAMMOTH toolkit is publicly available online on GitHub encouraging developers and researchers to contribute to the toolkit’s development.

Authors: Timothee Mickus, Stig-Arne Grönroos, Joseph Attieh, Michele Boggia, Ona De Gibert, Shaoxiong Ji, Niki Andreas Lopi, Alessandro Raganato, Raúl Vázquez, Jörg Tiedemann