The Case for Contextual Evaluation in Measuring LLM Translation Quality

Sheila Castilho SlatorCon

At SlatorCon Zurich Dr. Sheila Castilho, Assistant Professor at Dublin City University, spoke about how contextual evaluation helps to assess the capabilities and limitations of large language models (LLMs).

Dr. Castilho began by providing historical context, reflecting on the excitement surrounding neural machine translation (NMT) systems in 2015-2016. She explained that during this period, media headlines celebrated NMT as “as good as human.” This hype gave rise to the term “human parity” in reference to NMT systems, causing concern among human translators.

But the initial hype faded beginning of 2018 as extensive analysis revealed that “human parity” was hard to achieve. The evaluation methods used during that period were found to be flawed, with problems relating to the quality of reference translations and the absence of contextual information in evaluations. These shortcomings led Dr. Castilho to investigate the field of contextual evaluation, which she has been researching for several years.

Dr. Castilho’s talk then put the spotlight on the current excitement around artificial intelligence (AI), and, particularly, LLMs. Despite the promise of LLMs, Dr. Castilho expressed concerns about the current state of evaluation methods. She noted that many recent research papers still rely on traditional metrics like BLEU scores, which were developed decades ago and may not be suitable for assessing the nuanced outputs of modern LLMs.

“Why are we still stuck in evaluating the quality of translation at the segment level or with automatic metrics only? There is absolutely no reason that we’re still doing that. The errors are different, the quality is different. Why is the evaluation the same still?” she said.

Rigorous Evaluation

Dr. Castilho presented evidence that evaluating translations solely at the sentence level is often inadequate and she argued for more rigorous assessment methods to match the complexity of LLMs. 

“Given the sophistication of these models, it only stands to reason that we need to conduct a more rigorous evaluation,” she said. “If you are making extraordinary claims, you must provide extraordinary evidence to support them,” she added.

According to Dr. Castilho, contextual evaluation has the potential to reveal translation issues that sentence-level evaluation might miss. During her talk, she presented an example of gender bias in translation to illustrate how context-aware evaluation can help uncover issues. 

She also highlighted the need for responsible adoption of LLMs and the importance of comprehensive evaluation to uncover their capabilities and limitations. Dr. Castilho urged the audience to consider the ethical implications of AI-driven evaluation and AI systems certifying other AIs, while also addressing the challenges of diversity and data recycling in AI development.

“Why are we still stuck in evaluating the quality of translation at the segment level or with automatic metrics only?” – Dr. Sheila Castilho, Assistant Professor, Dublin City University

The talk wrapped up with an interactive session where the audience posed questions, addressing topics such as the adoption of improved evaluation metrics and the potential leaders in driving change in the field of AI translation.

For those who could not attend in person, SlatorCon Zurich 2023 recordings will be available in due course via our Pro and Enterprise plans.