“Automatic speech recognition (ASR) underwent a revolution after the advent of deep neural networks,” said Giuseppe Daniele Falavigna, Senior Researcher at Fondazione Bruno Kessler (FBK), and Marco Turchi, Head of the Machine Translation group at FBK.
ASR refers to technologies built to process human speech and convert it into text. The first instance of speech recognition dates back to 1952, when a single-speaker digit recognizer called “Audrey” was invented at Bell Laboratories.
A few years later, in 1962, IBM introduced “Shoebox,” a speech recognition system able to recognize 16 different words. The next breakthrough occurred in mid-1970 with Hidden Markov Models (HMMs). HMMs use probability functions to determine the correct words to transcribe and have been successfully applied to the area of ASR for many years.
Since then, ASR has been rapidly developing. Today, there are two main approaches to ASR: a traditional hybrid approach and an end-to-end, deep-learning approach. In the traditional hybrid approach, the ARS system usually consists of separately trained acoustic, pronunciation, and language model components.
In the end-to-end approach, the system is capable of learning the acoustic, language, and pronunciation information contained in the speech, and can directly map a sequence of input acoustic features into a sequence of words.
Many people now use ASR on a daily basis to either perform voice-search queries, send text messages, or interact with voice assistants. Moreover, ASR technology has been proposed as a tool to augment professional performance in translation, post-editing, interpreting, and subtitling.
Increased Productivity for Translators
“Computer-Assisted Translation (CAT) tools, for the most part, are based on the traditional input modes of keyboard and mouse,” according to a recent study. However, with the potential to increase productivity during the translation process, commercially available CAT tools have started offering integration with ASR systems, such as memoQ combined with Apple’s speech recognition service or Matecat combined with Google Voice. With other CAT tools, it is possible to use commercial ASR systems for dictation, such as Dragon Naturally Speaking.
Dragoș Ciobanu, Professor of Computational Terminology and Machine Translation at the University of Vienna, reported on the successful use of ASR by freelance translators — especially when not combined with translation memory software.
According to the same study, translators are more productive when using ASR. Some of them benefit from ASR by being able to translate faster, while others benefit from it by being able to perform other jobs more quickly, such as searching the web or drafting emails. To achieve maximum productivity gains, many translators continue to use the CAT software’s keyboard shortcuts alongside ASR.
LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. Browse new jobs now.
Apart from actual productivity gains — with ASR the typing speed increases from 40 up to 150 words per minute — Ciobanu highlighted some additional benefits of using ASR in translation. According to his study, ASR allows for “more flexible, translator-centered, ergonomic workflows and workspaces.”
Translators spend many hours every day in front of their computer suffering from repetitive strain injury (RSI), eye strain, back pain, and shoulder and neck pain; and ASR has the strong potential to address these conditions.
Additionally, working with ASR is also likely to address the literalness that can be brought about by extensive use of CAT tools and machine translation. Finally, ASR opens up possibilities for blind and visually-impaired individuals to work in the translation industry.
Challenges Of Decoding Human Speech
Decoding human speech is not an easy task, mainly due to the following:
- Homophones (different words that sound the same, and require more than just the sound alone to understand)
- Code-switching (rapidly switching between dialects or languages, which is extremely common in normal human conversation around the world)
- Variability in the volume, speed or quality of someone’s voice
- Ambient sounds (e.g., echoes, road noise)
- Transfer influences from one’s first language(s) to second languages
- Other conversational devices we use, such as elisions (skipping sounds within words to say them more easily), or repair (making a small error and going back to correct it)
- Paralinguistic features (pace, tone, intonation)
Thus, translators need to check their ASR-generated translations much more closely than their typed ones. A certain level of translation experience is also required before ASR can be successfully integrated into the practice, according to research.
Speech Technologies and Post-Editing
More recently, the potential of using ASR for post-editing purposes has also been investigated.
According to a recent study, “using speech instead of typing can speed up the work of the translator” — even in the context of post-editing.
A study investigating the effects on productivity and on a translator’s experience of integrating machine translation post-editing with speech technologies revealed that post-editing with the aid of a speech recognition system was faster than translating with the aid of a speech recognition system; and also less tiresome (i.e., more ergonomic).
Similarly, another study that looked into the possibility of using speech technologies for post-editing purposes in the context of international organizations revealed that translators working there were open to try speech-based post-editing as a new translation workflow.
Finally, another study found voice input to be more interesting than typing alone for post-editing, not only because some segments may require major changes (and could, thus, be dictated), but also because, if the post-editor is not a touch-typist, the visual attention back-and-forth between source text, machine translation output, and keyboard adds to the task’s complexity.
Therefore, voice input adds another dimension to the post-editing task, allowing translators to combine or alternate between different input modes depending on the task difficulty and the changing conditions of human-computer interaction.