In a research paper published on May 19, 2023, Andrea Schioppa, Xavier Garcia, and Orhan Firat from Google demonstrate the benefits of incorporating cross-lingual supervision during pre-training large language models (LLMs).
As the Google researchers explained, LLMs are typically pre-trained using self-supervision, where models learn from unlabeled data without manual annotations. However, it has been observed that incorporating cross-lingual supervision — which involves aligned parallel data between source and target languages — during the pre-training of LLMs can improve their in-context learning abilities.
The researchers demonstrate that combining self-supervised language modeling and supervised machine translation (MT) objectives — by including cross-lingual parallel data during pre-training — leads to improved performance of LLMs in MT tasks.
As Schioppa, Garcia, and Firat explained, LLMs undergo pre-training through self-supervision, enabling them to learn from unlabeled data without the need for manual annotations. However, MT systems rely on cross-lingual supervision, which involves aligned parallel data between source and target languages.
“The MT objective consists in predicting the target sentence given the source sentence, and therefore it is necessary to collect aligned pairs of texts between source and target languages,” the researchers said.
They highlighted that including cross-lingual data during pre-training not only strengthens MT capabilities but also helps bridge the gap between different languages. The researchers noted that the pre-training datasets are often dominated by English, resulting in the under-representation of other languages, particularly those with fewer resources. Incorporating aligned cross-lingual data opens up new possibilities for improving LLMs across various languages.
As they stated, “aligned cross-lingual data might enhance the abilities of LLMs across languages other than English.”
The Optimal Balance
Determining the optimal balance between self-supervision and cross-lingual supervision is challenging due to the resource-intensive nature of pre-training. To address this, the Google researchers proposed a strategy to dynamically adjust the mixing ratio between the two objectives during pre-training.
More specifically, they introduced automated curriculum learning with multi-armed bandits as an effective method for determining the optimal amount of parallel data to utilize during training.
Automated curriculum learning with multi-armed bandits is a machine learning strategy that dynamically selects training samples to optimize the learning process. It follows a sequential decision-making approach, treating each sample as an “arm” and deciding which ones to prioritize based on exploration and exploitation strategies.
According to the researchers, this approach offers substantial gains by eliminating the need for computationally expensive grid searches and outperforms static data sampling baselines. “When faced with learning an optimal amount of cross-lingual supervision to use, we show that automated curriculum learning is an effective strategy that does not require multiple training runs and which outperforms static policies,” they said.