Until now, interpretation has remained a profession relatively unchanged by software advances, largely because it is a nuanced art form that is, as one professional interpreter says, “as much interpersonal as it is linguistic”.
While much has been said on the topics of translation productivity tools, machine translation (MT) and measuring MT quality, comparatively little research has been done before now into how similar quality estimation (QE) methods might be applied to the interpreting arena.
A group of researchers in the US, led by Jordan Boyd-Graber of the University of Maryland and Graham Neubig of Carnegie Mellon University, has just published a paper titled Automatic Estimation of Simultaneous Interpreter Performance. Here, we outline the findings of the paper and examine its implications. We also include a Q&A with the authors and comment from a professional interpreter.
Building A Real-Time Quality Feedback Loop
Software tools designed to assist interpreters (CAI) already exist and work by providing terminology hints during interpreting assignments. These have had mixed results since there has not been any real way to work out how much information to give an interpreter and when.
Too much or irrelevant information is distracting and can interfere with the cognitive process, hindering the performance of the interpreter.
To get around this, the team wanted to develop a way to assess an interpreter’s performance in real-time, i.e. to make it possible to tell if an interpreter could benefit from software assistance there and then. If yes, the productivity software tool would elect to give terminology hints, but not otherwise.
Generating a reliable real-time performance assessment is key to making software assistance for interpreters selective and therefore more useful, and to improving the quality of simultaneous interpreting (SI) output.
To do that, it was first necessary to develop a quality estimation (QE) method based on samples of SI output that could then be applied to a real-time performance scenario.
Developing Quality Estimation Methods
Simply using existing machine translation (MT) QE methods is not good enough because there are a number of clues that indicate the quality of SI output that are different from those that indicate good MT output. For example, when an interpreter is struggling they may generate a choppy translation with lots of pauses, or drop large portions of content that they were not able to interpret properly.
Simply using existing machine translation (MT) QE methods is not good enough
Despite this, existing MT QE methods were considered to be a good starting point, and the team of researchers decided to use an MT QE model, called QuEst++, as a baseline. They then customized QuEst++ for use on SI output by adding a number of features that are specific indicators of interpreter performance.
1) the ratio of pauses/hesitations/incomplete words – because “an increased number of hesitations or incomplete words in interpreter output might indicate that an interpreter is struggling to produce accurate output.”
2) the ratio of non-specific words – because “Interpreters often compress output by replacing or omitting common nouns to avoid specific terminology either to prevent redundancy or to ease cognitive load.”
3) the ratio of ‘quasi-’cognates – because “transliterated words in interpreted speech could represent facilitated translation by language proximity, or an attempt to produce an approximation of a word that the interpreter did not know.”
4) the ratio of number of words – to “compare source and target length and the amount of transcribed punctuation.”
The researchers set about testing the validity of these additional features in a series of experiments using English-French, English-Italian and English-Japanese interpreter output. They found that their SI QE model had statistically relevant gains and outperformed METEOR across all languages. It was particularly effective for English-Japanese.
What Next?
Based on these results, the team proposed that the SI QE method could be used immediately to refine the performance of existing interpreter assistance software in order to improve SI output. They also suggested that future research work could explore what other combinations of features could help to tweak the model to be more attuned to the particularities of interpreter performance.
Slator reached out to Graham Neubig, Assistant Professor in the Language Technologies Institute, Carnegie Mellon University (CMU), with a number of questions about the paper. We received the following responses from Neubig and Craig Stewart, a graduate student on the Language Technologies Program at CMU who Neubig credits as being the “main driving force behind this paper”.
Q. What led to the research being done? Why this topic in particular?
A. Simultaneous interpretation is an extremely difficult job that requires the interpreter to both listen to the speaker and produce output simultaneously, requiring careful concentration and having very little margin for error. We are currently pursuing a project on Computer Assisted Interpretation (CAI) interfaces, a collaboration between Carnegie Mellon University, University of Maryland, and University of Washington, funded by the U.S. National Science Foundation. These interfaces will provide automated assistance to interpreters in an attempt to make their job easier and make their interpretation more precise. The basic idea is that we would like to give them real-time assistance with numbers, terminology, or other difficult-to interpret content while they are interpreting.
However, putting a screen in front of interpreters is tricky. Interpreters already have an extremely difficult job and as much as the ultimate goal of CAI is to make this job easier, it is not difficult to imagine that such an interface could easily become distracting and do more harm than it does good. The goal of the work described in the current paper is to create tools that would allow us to assess when interpreters might be struggling with their interpretation, and provide focused assistance only when it is necessary, adjusting to the interpreter’s needs and minimizing unnecessary distraction.
Finding a way to maximize the utility of CAI and minimize disruption to interpreter workflow seems like an interesting and important challenge to us.
Q. What next? You conclude the paper by saying “creation of fine-grained measures to evaluate various aspects of interpreter performance is an interesting avenue for future work.” Is this something that is in the pipeline?
A. Currently in our work, we are using a metric called METEOR, which was designed for assessing the accuracy of machine translation systems, as a proxy for measuring interpreter performance. This metric is relatively sophisticated, taking into account the importance of the words in the sentence and taking into account the fact that there are multiple ways to translate a particular sentence. However, it is not directly aligned with the types of metrics that are used to evaluate interpreters, for example in a pedagogical setting. If we can come up with better ways of figuring out when an interpreter needs assistance in a way that is more aligned with the interpreter strategy and workflow, then we feel this would be more directly aligned with our final goal of creating good CAI systems. One thing that we are currently looking at is examining not just the content that the interpreter uttered, but also how they uttered it (e.g. is the speech smooth or choppy), to see whether this information can better inform the system.
One thing that we are currently looking at is examining not just the content that the interpreter uttered, but also how they uttered it
Q. How has the research been received so far?
A. With respect to the reception of the natural language processing community, the paper was well received: it has been accepted at the annual meeting of the Association for Computational Linguistics (ACL), which is the flagship publication venue in our field. The paper will be presented at the conference in Melbourne in July.
For the current broader project on CAI in general, we have been discussing with interpreter collaborators at the interpretation schools of University of Maryland and University of Bologna. We haven’t yet had any direct feedback on the method described in this paper, but our discussions with them have indicated that distraction is a primary concern for them when thinking about CAI systems.
We are currently preparing for trials where we elicit feedback from interpreters on the usefulness of the system on a larger scale. If any of the Slator readers find this topic interesting and would be interested in seeing a trial of the system and giving feedback, we would be delighted to work with them.
Q. What is the likelihood of you deciding to commercialize the QE metric for SI?
A. We have no concrete plans to commercialize this at the moment, but if there are any parties interested in commercialization, we would be happy to discuss details with them.
For a practitioner’s perspective, Slator reached out to Dr Jonathan Downie PhD, AITI for his take on the area of CAI and quality metrics. Downie is an interpreter whose PhD was on client expectations of interpreters and he is a co-host on podcast Troublesome Terps. This is what he told us:
“Building Computer-Assisted Interpreting systems is in everyone’s interests. Anything that can reduce the cognitive effort of jobs like term searching, number management and working with names is good news.”
“Measuring interpreting quality is an incredibly complex problem, all the more so since research from the mid-1990s has shown that interpreting is as much interpersonal as it is linguistic. The words and prosody can be perfect and yet the interpreting can still flop and the meeting could crash. Conversely, the interpreter could have a choppy, hesitant day but the client could love it and the meeting could go well. Their simplistic view of quality and interpreting itself means that their tool is interesting but more theoretically useful than practically useful.”
“There is real potential for CAI tools but they need to be interpreter-driven, rather than using automated quality metrics that might not be measuring quality in the same way as clients and users do. My advice to the developers would be to create software that allows interpreters to tune the amount and kind of help they receive on-the-fly and to work to get the UX and UI clean and quick, perhaps even using transparent screens, as featured on the flight-decks of some aircraft, so the interpreters can still see the speakers.
“Smart technology with interpreters in the driving-seat is the way to go.”
About The Authors
The paper is co-authored by a group of students and professors:
- Jordan Boyd-Graber – Associate Professor in the University of Maryland Computer Science Department (tenure home), Institute of Advanced Computer Studies, iSchool, and Language Science Center).
- Junjie Hu – Ph.D. in Language and Information Technology in Language Technologies Institute at Carnegie Mellon University
- Graham Neubig – Assistant Professor in the Language Technologies Institute, Carnegie Mellon University
- Craig Stewart – Master of Language Technologies Program in Language Technologies Institute at Carnegie Mellon University
- Nikolai Vogler – Master of Language Technologies Program in Language Technologies Institute at Carnegie Mellon University
Neubig’s research is concerned with language and its role in human communication, with a particular focus on breaking down barriers in human-human or human-machine communication through the development of natural language processing (NLP) technologies.
Boyd-Graber focuses on making machine learning more useful, more interpretable, and able to learn and interact from humans.
Download the Slator 2019 Neural Machine Translation Report for the latest insights on the state-of-the art in neural machine translation and its deployment.