In a SlatorCon Remote September 2022 panel discussion moderated by Slator Research Analyst, Maria Stasimioti, Adam Bittlingmayer, CEO of ModelFront, and Conchita Laguardia, Senior Technical Program Manager at Citrix, provided valuable insights into machine translation (MT) quality estimation (QE).
The discussion started with Bittlingmayer defining quality estimation as a prediction of whether a translation is good or not. “It takes a source segment and a target segment, and returns some score,” Bittlingmayer said.
In addition, Bittlingmayer highlighted the fact that quality estimation differs from automatic evaluation metrics, such as BLEU score, since it does not require a human reference translation. QE works with machine learning (ML), so it’s “not for offline scoring on work that’s already been done, but for new, live content,” he said.
QE can predict quality at various levels of text, including at the level of the word, phrase, sentence, or even document. “It is very flexible,” Laguardia pointed out. At Citrix, QE at segment and document level has proven useful.
Explaining what QE is used for, Bittlingmayer said, “What everybody wants to do is say if a raw MT output is good or bad so that they don’t have to use human translation for that.” ModelFront refers to this use of quality estimation as the hybrid translation workflow – part human, part machine – because some segments were never seen by a human.
In the hybrid translation workflow, “you first hit the translation memory (TM) and all 100% matches skip over human review, as usual. Then you hit a machine translation API, and then you hit the quality prediction API,” Bittlingmayer said.
A quality threshold is set up front and each new MT segment is automatically classified as high-quality or low-quality. A high-quality MT segment is machine-approved as is and ready to be published, similar to a 100% TM match. Low-quality MT, on the other hand, is sent for human post-editing, as before.
As for Citrix, Laguardia said they found that “as much as 85% of the segments in some languages were left untouched during postediting.” She said they use QE to predict the quality of these segments in order to take them out of the light postediting process and skip over human review.
Benefits and Challenges
According to Laguardia, the main benefit QE gives Citrix at the moment is the ability to release content faster and more often — “our main motivation,” she said.
She added, “We believe that localization value comes from offering the right balance between quality and velocity,” [because it is no use] “if you offer high quality content, but it comes three months later.”
Aside from speed and price — two of the oft-cited benefits of QE — Bittlingmayer said there are ways to improve quality with QE as well since “there are cases where these [QE] models can catch things that humans may have missed.”
Thus, a mindset shift is required. As Laguardia pointed out, “The traditional way of looking at quality management is always very language based. What you’re asking [the industry] now is actually to trust a machine to tell where the MT has gone wrong.”In addition, the translation management system (TMS) poses an extra challenge. Laguardia said, “The TMS is still at the center of a workflow. [But] if you want adoption and you want to gain confidence in the system, you need to have transparency and visibility in the TMS. I find the TMS support for anything that is not just calling the MT engine is a little bit still in its infancy.”
MTQE Adoption and Outlook
Most commercial QE offerings are not available as standalone products but are instead included as part of a broader language technology or service offering. Examples include OpenKiwi from Unbabel, KantanQES from KantanAI, Omniscien’s translation confidence scoring and quality estimates, and Memsource’s MT Quality Estimation (MTQE).
One notable exception, according to Bittlingmayer, is ModelFront’s standalone production system for QE, which is independent from any TMS, MT engine or LSP
Bittlingmayer said the adoption of MTQE “probably rounds to about 0% or 0.1%,” noting that its use is limited to the biggest buyers and producers of translations currently. For there to be more widespread MTQE adoption, Laguardia and Bittlingmayer agreed that ready-to-use integrations in TMS and CAT tools need to be commercially available.
“We can access MT engines, but we are left on our own to actually manage what comes out from them. I think building more tools that allow for more MT-centric workflows will help massively with adoption,” Laguardia said.
“Like with MT, what’s accessible to teams with developers is years ahead of what’s readily accessible in tools,” Bittlingmayer added.