Currently, most machine translation (MT) systems are English-centric, resulting in a process called pivoting when translating between two non-English languages. This process involves translating to English and then translating to the target language, which, in turn, can lead to so-called error cascades, such as losing important information about gender and formality or increased latency.
Multilingual Neural Machine Translation (MNMT) aims to improve the quality of translations between non-English languages by reducing latency and avoiding error cascades that occur when translating through English. However, training multilingual models is not an easy task, as the more languages added, the more they compete for the model’s parameters.
Increasing model size is not always a viable solution, as it may lead to difficulties in training, slower inference, and larger storage requirements, researchers from Apple explained in a research paper published on May 4, 2023.
To address this issue, the researchers proposed a new solution called Language-Specific Transformer Layers (LSLs). This method increases model capacity per language while allowing sharing of knowledge between languages without increasing the inference cost.
The proposed architecture includes shared and language-specific weights, where some layers of the encoder are source or target language-specific, while the remaining layers are shared. “The idea of LSLs is simple: instead of sharing the same parameters across all languages, have the weights for the layer be language-specific,” said the researchers.
This method “benefits from having both language-specific and shared components, as well as from having source and target language-specific components,” they added.
LSLs consist of one “regular” Transformer encoder layer per language. The input is routed to the appropriate sub-layer based on the source or target language, and only one sub-layers is used at any given time.
Simply replacing all layers in the Transformer with LSLs would increase the number of parameters and decrease sharing between languages, explained the researchers. To avoid this, they suggest using a combination of LSLs and regular Transformer layers, which enables the model to learn both shared and language-specific weights.
Discovering the Best Architecture
To automatically determine which layers should be shared and which should be source- or target-indexed LSLs, the researchers proposed a neural architecture search (NAS) inspired approach. NAS utilizes optimization algorithms to discover and design the best architecture for a neural network for a specific need.
In addition, the researchers found that initializing all encoder weights from a pre-trained architecture consisting only of “regular” Transformer layers helped to achieve better performance. They used pre-trained weights from their baseline architectures to initialize the language-specific modules.
According to them, this approach maximizes cross-lingual transfer, mitigates under-trained language-specific components for low-resource languages, and improves convergence speed for architectures with LSLs.
In their experiments, they focused on ten languages, including English, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Swahili, and Chinese. The proposed approach resulted in substantial gains for both high-resource — such as English and German — and low-resource languages — such as Korean or Swahili.
The researchers highlighted that using multilingual instead of bilingual translation systems can help reduce gender bias that arises due to pivoting through English. They also said that their proposed architecture can lead to smaller and faster-to-train models compared to similarly-performing baselines, which can enhance the efficiency of translation systems.