In Tech-Assisted Interpreting, the Tricky Problem of Non-English ‘Person Names’

Translation and person names using automatic speech recognition

Interpreters and their clients have embraced telephonic (OPI) and virtual (VRI) modes in cases where on-site interpreting is not an option, especially since the pandemic. Computer-assisted interpreting (or whichever term we eventually settle on), on the other hand, has a long way to go before mainstream adoption.

New research has zeroed in on “person names” as a specific hurdle for both automatic speech translation (ASR) and speech translation (ST), distorting audio input and preventing the technology from being widely adopted. ASR and ST systems both have a transcription and translation accuracy rate of about 40% for personal names.

Metrics for evaluating speech translation quality are currently somewhat “insensitive” to errors related to personal names and numbers, which make up very important content for humans. Like machines, humans also struggle to handle unfamiliar names in languages they have not learned, so interpreters might be interested to use computer-assisted interpreting exclusively for help with this task.

Fondazione Bruno Kessler researchers Matteo Negri, Marco Turchi, and Marco Gaido (who is also affiliated with Italy’s University of Trento) explain in their May 2022 paper, Who Are We Talking About? Handling Person Names in Speech Translation, that personal name errors typically stem from names that appear infrequently in training data, as well as from a lack of training data in the language of the “referent name.”

ASR and ST models trained on English audio try to force every sound to match English words, which can distort personal names from other languages.

Generalize and Disambiguate

“Current solutions rely on predefined dictionaries to identify and translate the elements of interest,” the authors wrote, preventing these solutions from generalizing and disambiguating homophones or homonyms.

In order to be useful, an ST or ASR system would need to “reliably recognize and translate [named entities] and terms, without generating wrong suggestions,” the authors explained.

With the long-term goal of integrating ST models into assistant tools for live interpreting, the group created multilingual models, trained with audio in different languages, to produce transcripts and translations into Spanish, French, and Italian.

Even though 80% of the total training data was in English, adding audio from another language to the corpus helped to correct the handling of personal names in that language by 48% on average, producing useful translations for interpreters in 66% of cases.

In addition to incorporating data from other languages, the researchers also added the referent names, finding that the more frequently a name appeared, the more likely the system would transcribe the name correctly.

The study noted, “On average, names occurring at least three times in the training set are correctly generated in slightly more than 50% of the cases, a much larger value compared to those with less than three occurrences.”

Still, confusing or distracting transcriptions of personal names accounted for 15% of the results, leaving room for future research to examine what level of accuracy would be required to help interpreters in action — and figure out how to attain it.