Is Transcription the Canary in the Translation Gold Mine?

Transcription and written translation share a number of similarities (convert language input in one form or another into text) and have their differences (transcription typically involves only one language). From a supply-chain point of view, the two activities are similar as they allow for work to be instantly sent across the globe.

Arguably, transcription is the less complex activity of the two. While there is such a thing as a factually perfect transcription (i.e., everything said was perfectly transcribed), a translation, by default, is an interpretation of the source content and, therefore, open to debate.

As progress in machine learning and artificial intelligence accelerates, one would expect its impact on the human workforce to be experienced first in transcription.

Exactly three weeks after Slator broke the story on Google’s “nearly indistinguishable” claim when they launched the new, neural-network-powered Google Translate, Microsoft claimed a “historic achievement” on their official blog: speech recognition technology that “recognizes the words in a conversation as well as a person does.”

The subject of the blog article was a paper published the day prior, on October 17, 2016, entitled “Achieving Human Parity in Conversational Speech Recognition” and authored by the Microsoft Speech & Dialogue research group. In it, Microsoft’s researchers say they had achieved a word error rate of 5.9%, an improvement from the 6.3% the team reported a month before. Said Microsoft Engineer and Chief Speech Scientist Xuedong Huang, “We’ve reached human parity. This is an historic [sic] achievement.

Depending who [sic] you ask, speech recognition is either solved or impossible. The truth is somewhere in between—Gerald Friedland, UC Berkeley

The word error rate or WER is a common metric for evaluating speech recognition, as the BLEU score is for machine translation. The blog article went on to point out that a 5.9% error rate is “about equal to that of people who were asked to transcribe the same conversation.” The goal being, of course, to approach zero and achieve 100% accuracy.

Just 10 days before Microsoft published its paper, PC World Senior Editor Mark Hachman had called Microsoft’s speech recognition the “weakness no one mentions” and said his Windows Speech Recognition test drive yielded a 6.4% word error rate, which was “pretty bad on paper.”

Qualifying that it was just the baseline and that, properly trained, Microsoft employees claim their speech recognition can achieve 99% accuracy, Hachman nonetheless concluded that training speech within Windows is a lengthy process. It actually took 10 minutes, which Hachman said felt “like a lifetime.”

He went on to say that training speech is faster (“perhaps a minute or so”) for Nuance’s speech recognition software Dragon, which announced its own breakthrough back in August. According to Vlad Sejnoha, Chief Technology Officer at Nuance, Dragon’s deep neural nets can now “continuously learn from the user’s speech…and drive accuracy rates in some instances up to 24% higher.”

Dragon and similar voice recognition software is to transcriptionists what translation productivity tools (or CAT tools) are to translators: They do not replace the transcriptionist, but make them more productive. To extend the analogy, the technology that operates on the original audio file to produce textual output would be the equivalent of machine translation.

What is the impact of productivity tools like Dragon on human transcriptionists and the overall transcription market? [Transcription is part of the broader document preparation services market, that is supposedly worth USD 5bn in the US alone. Fact is, Slator could not pin down a credible figure.] And is fully automated speech recognition technology about to replace human transcriptionists in the real world? Slator spoke to four professional transcriptionists for their take on the matter.

How to Train Your Dragon

One source we spoke to said there are still way too many variables for technology to be able to completely replace the entire human transcription process.

According to Belle Lapa, who founded Scriberspoint in 2012 after having worked for such companies as Lingo24 and S&P, “The way a speaker speaks, overlapping speakers, background noise and, perhaps, most importantly, the very variable nature of language itself” have yet to be factored in.

It just doesn’t make sense to keep deleting things if it’s that bad

In terms of voice input to increase productivity (i.e., listen to audio, repeat what you hear into mic instead of typing), transcriptionists today still need to train their speech recognition tools for them to be useful, in much the same way translators do translation productivity tools. Think adaptive machine translation and the need to personalize the system.

According to another professional transcriptionist to whom we spoke, using Dragon alone, untrained, yielded poor results. “It was simply unusable,” said the source, adding, “A human transcriptionist can achieve 80–90% accuracy with Dragon and, with the help of a real-time editor, can get it up to 93–96%.”

The source said that, to make a transcriptionist work faster, a tool has to reach 65–75% accuracy (“using our own in-house accuracy checking tool”). Anything below would be next to useless. “It just doesn’t make sense to keep deleting things if it’s that bad. It just slows up the typist,” our source pointed out.

The same source added that accuracy also depends on accents, naming as typical transcriptionist waterloos audio from speakers with Indian, Portuguese, and French accents.“Even when using Nuance’s Dragon, we still need humans to train the tool for context, nuance, homophones, proper nouns, accents, etc.,” our source said.

Here Too, Still a Long Way

Our source agreed to speak to us on condition of strict anonymity as professional transcriptionists are bound by stringent non-disclosure agreements. The source did disclose, however, that transcriptionists generally do not see their job as a “real career.”

Explained our source: “It’s probably what I’d call a high-paying, labor-intensive job. We are usually there to keep our bills paid — while we work to go somewhere else. Most of the best transcriptionists in our department have moved on to other things not transcription-related. Those who stayed go on to management. There’s just no real career growth there. What else can you do, really? Type faster? Speed-read?”

Asked if they feel, at all, threatened at being replaced by technology, our source said, “It’s not like we’d picket if the company suddenly announces a tool has replaced us. We’d hate losing a high-paying job, for sure. But, like I said, we don’t view transcription as a real career.”

Besides, said our source, technology still has a long way to go before achieving human-quality transcription — an on-the-ground assessment, despite The Economist saying that, “Thanks to deep learning, machines now nearly equal humans in transcription accuracy.”

It’s not like we’d picket if the company suddenly announces a tool has replaced us

Our source, a transcriptionist at a Fortune 500 company, said, “Anything purely done by tech is unusable. We did try, but the results were so bad. The best tools we had still needed human supervision and they are still being developed with the idea of having humans eventually edit the end result.”

For her part, Scriberspoint’s Belle Lapa said certain parts of a transcriptionist’s job have indeed been made easier by technology. “Foot pedals have replaced keyboard shortcuts and voice-recognition tools and software are able to capture speech and turn them into text, with varying degrees of accuracy,” according to Belle.

Belle is optimistic that speech-to-text technology can only get better with time. She said they now use “a very good dictionary,” glossary, and database tools, which help them in subjects like legal and medical where terms are highly specialized.

The Scriberspoint founder, who is based in the Philippines, benefitted greatly from what she described as “the last big boom in transcription” when the US required subtitles for the hearing impaired. Since then “the number of clients and their transcription needs have remained stable or moving upward, never downward.”

She regards India as their greatest competitor, “if only because of the cheaper rates they charge,” but said she is not worried, for now, about running out of clients.

Not so optimistic is “Carla” [last name withheld on request], who is also based in the Philippines and is now a content editor at a financial intelligence firm. Looking at the transcription industry from a distance, having left it “for a while now,” she would not call it “a hot market.”

She said, “Transcribing through typing is increasingly being replaced by voice transcription or captioning,” and sees technology replacing human transcription first taking place in captioning, subtitling, news, and medical.

Carla added, however, that if it involves voice transcription with near real-time editing, then there will be reasonable opportunities in financial, legal, and medical.

Transcribing through typing is increasingly being replaced by voice transcription or captioning

As for competitor India, she said the country may offer cheaper labor and tech, but “the Philippines would be in the higher tier if you were to factor in English listening skills, acquired typing skills, and adaptability.”

Prices Dropping

Much less optimistic is Noriel Ramientas II, who recalled that when he started doing home-based transcription, part-time, in February 2012, “everything was good — the pay was good, projects kept coming in, it was all just peachy. Since March 2015 though that hasn’t been the case.”

Fewer projects came in, he said, and the pay had reached a ceiling of USD 35 per audio hour, compared to five years ago when one could charge as high as USD 50 for the same audio length.

“I don’t mean to burst anyone’s bubble but I don’t think the transcription market — at least as far as home-based transcriptionists are concerned — is growing. That’s why I haven’t been doing it for more than a year now,” said Noriel.

He described other online jobs (blogging, virtual assistance, bookkeeping) as more readily available and higher paying, and the demand for transcriptionists, diminishing, “as more people, most notably those from India, flock to online job marketplaces like Upwork and Guru.”

As for tech replacing humans, Noriel’s view was pretty simple. He said it is just a tool. “Tools never function on their own. Someone sentient must put a tool to use before it is deemed useful.”

The experts seem to agree, especially as far as speech recognition technology and other transcription tools go. Although Dragon founder James Baker once said large vocabulary speech recognition was a solvable problem within his lifetime, more recently, Gerald Friedland said, “We used to joke that, depending who [sic] you ask, speech recognition is either solved or impossible. The truth is somewhere in between.”

Friedland, Audio and Multimedia Research Director at the UC Berkeley-affiliated International Computer Science Institute, was quoted in an April 2016 Wired article, called Why Our Crazy AI Still Sucks at Transcribing Speech.

Similar technological forces drive transcription and translation. And the lessons that transcription holds for translation are not new: Go niche, specialize, and embrace technology to increase productivity.