In a paper published on November 15, 2023, researchers Miguel Moura Ramos, Patrick Fernandes, António Farinhas, and André F. T. Martins demonstrated how the quality of text generated by LLMs could be significantly improved by incorporating human feedback at various stages of the MT process.
More specifically, the researchers explored diverse techniques for integrating quality metrics as reward models into the MT pipeline. They highlighted the significant progress made by evaluators in developing automatic quality estimation (QE) and evaluation metrics learned from human quality annotations — such as COMET-QE, COMET, BLEURT — and emphasized that these metrics can be repurposed as reward models.
Their experiments involved data filtering, training through reinforcement learning (RL), inference time reranking techniques, and a combination of these methods.
The researchers underlined the novelty of their work, stating that “none of the previous work has systematically compared the effect of integrating metrics at different stages of the MT pipeline or has attempted to combine these techniques in a unified approach.”
“By doing this, the MT system will prioritize translations that are more aligned with human judgments, therefore reducing the chances of generating severely incorrect translations”
Feedback at an Early Stage
A significant contribution of the study is the proposal of an alternative data filtering method using COMET-QE. The researchers explained that COMET-QE is an ideal preference model for data filtering, being a multilingual reference-free neural-based metric trained on human annotations of translation quality, “accurate” in QE, and with a “superior alignment with human judgments.”
The proposed method aims to curate high-quality datasets, effectively minimizing RL training instability. The researchers said that such quality-aware data filtering could “significantly increase the performance of MT systems by introducing feedback in an early stage of the pipeline.”
However, the effectiveness of this process depends on the selected metric. Using metrics not closely aligned with human judgments can result in poorly correlated and misaligned sentences, making the training process more unstable. Therefore, the use of robust QE models — such as COMET-QE or the more recent COMETKIWI model — is important.
Additionally, the researchers noted that quality metrics can have a “pivotal role” in classic RL training by providing rewards to optimize the MT model’s performance.
The RL-based training process involves a neural machine translation (NMT) model that generates translations which are in turn evaluated by the reward model (through rewards that indicate the quality of the translation). These rewards are then used by the policy gradient algorithm to refine the NMT model’s policy.
In contrast to previous works that predominantly used BLEU as the reward function, this study (again) identified the limitations of BLEU. This prompted researchers to leverage robust preference models during RL training, such as the reference-based COMET and the reference-free COMET-QE. The researchers explained that by incorporating these pre-trained preference models, the RL systems can better capture nuanced user preferences by receiving human-like feedback as rewards.
The results revealed that, in some cases, RL-based training alone did not yield significant improvements, but when combined with high-quality training datasets, it resulted in substantial enhancements.
The performance gains with COMET-QE, used as both data filter and reward model, emphasized the potential of RL-based NMT models trained with a QE reward model to outperform other RL-trained models. According to the researchers, this suggests promising opportunities for unsupervised NMT training with monolingual data — especially for low-resource languages —- by eliminating the need for reference translations in evaluation and reward signal generation.
Prioritizing Human-Aligned Translations
Finally, the researchers recommended incorporating quality metrics as rerankers during the decoding phase, prioritizing and selecting translations aligned with human judgments, minimizing the risk of generating inaccuracies.
“By doing this, the MT system will prioritize translations that are more aligned with human judgments, therefore reducing the chances of generating severely incorrect translations,” they said.
The researchers suggested that even if the underlying model has already undergone RL training using the same or a different preference model, incorporating preference models during the decoding stage can further improve translation quality.