On August 17, 2023, a team of Google’s DeepMind researchers introduced a new method to improve the quality of large language models (LLMs) by aligning them with human preferences. To demonstrate the efficiency of this method, they selected the domain of machine translation (MT).
The proposed method, known as Reinforced Self-Training (ReST), draws inspiration from growing batch reinforcement learning (RL). It works by initially having the LLM generate new synthetic training data first. Then, the LLM fine-tunes itself using this generated data guided by a reward model that assesses its performance and guides the learning process by providing feedback.
Why would DeepMind choose machine translation as a testbed for ReST? First, the researchers see MT as a highly “impactful application” of LLMs. MT also has “strong baselines and a well-defined evaluation procedure,” making it an ideal benchmark for assessing the ReST’s effectiveness.
Furthermore, the availability of “several existing reliable scoring and evaluation methods”, including Metric X, BLEURT, and COMET, which can serve as reward models, makes it possible to objectively evaluate the effectiveness of ReST, enhancing the credibility of the research.
To ensure the versatility of their approach, the researchers tested ReST on diverse benchmark datasets — encompassing IWSLT 2014, WMT 2020, and an internal Web Domain dataset — and across different language pairs. “We selected a different language pair for each dataset to test the generality of the results,” they said.
In addition to automated metrics, the researchers conducted human evaluations to ensure that ReST aligns with human preferences. These evaluations involved human raters who assessed translations on a scale from 0 to 6, adding a qualitative dimension to the assessment.
ReST Improves Translation Quality
The results demonstrated ReST’s ability to significantly improve translation quality, as indicated by both automated metrics and human evaluation on MT benchmarks.
According to the researchers, what sets ReST apart is its efficiency. It outperforms online reinforcement learning methods in terms of sample and compute efficiency because it generates training data offline, allowing for data reuse.
As highlighted by techno-optimist and AI accelerationist Far El in a tweet, ReST represents “1 more step towards fully autonomous machines and the beginning of the end of manual finetuning”.
Ok this is awesome, Reinforced Self Training, new RL finetuning method. 1 more step towards fully autonomous machines and the beginning of the end of manual finetuning (1 yr tops)— Far El (@far__el) August 21, 2023
Beyond MT, ReST exhibits promising potential in various generative learning settings, including summarization, turn-based dialogue, and generative audio and video models, as emphasized by the authors.
This adaptability positions ReST as a versatile methodology for advancing reinforcement learning from human feedback (RLHF) across a broad spectrum of language-related tasks, they concluded.
Authors: Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas