Still a Hassle for Translation Project Managers, But OCR Is Improving

Optical Character Recognition for Translation Agencies and Language Service Providers

Language industry project managers are all too familiar with this scenario: A client wants to translate a document in an uneditable file format. But, before anything, the PM must put the document through a round of optical character recognition (OCR) just to determine the word count. The task can be further complicated if the document is handwritten or contains text in an unknown language (or both — for a real headache).

Many companies have found ways around the problem of OCR. For small businesses, Adobe Acrobat might get the job done; but as a company grows, it might explore other options, such as OpenText’s series of Capture engines. ABBYY FineReader Engine also offers a suite of recognition products, including OCR technology advertised as working for up to 200 languages.

Google, for its part, has sponsored further development since 2006 of the open source OCR engine Tesseract, which was originally developed by Hewlett-Packard in the 1980s. The Google Cloud Platform also provides a tutorial on performing OCR using a collection of billable Cloud products. Amazon, meanwhile, prides itself on Textract’s ability to extract data from tables and charts while maintaining original formatting.

Aspirational Disruptors

Each newcomer to the OCR scene touts its algorithms and technology as the definitive answer to the OCR challenge. Language service provider Tarjama, based in Dubai, UAE, has built proprietary OCR tech based on neural networks.

Singaporean startup Staple specializes in documents where layout is important, such as invoices, tax forms, and bank statements; users can input documents in 100 languages via WeChat, Google Drive, and Dropbox.

Sid Newby, creator and CTO  of Cullable (and owner of the domain, embraces OCR’s bad reputation. He founded Cullable in 2015 based on years of experience in business litigation with eDiscovery (i.e., sifting through thousands of pages of documents for any possibly relevant information). Attorneys can miss a needle of critical evidence in a haystack of unsearchable text, which could be disastrous for their case.

Newby believes that the AI behind Cullable’s system makes it superior to competitors’ offerings. “Every page we process, essentially, we get a little bit better,” Newby told Slator. In terms of completing and recognizing partial words in text, he said, “We’re trying to understand thoughts. Then AI improves upon that knowledge base with new datasets that come in.”

Available to consumers since 2019, Cullable’s customers are predominantly US-based, with a few in the UK and South Africa. “Several translation companies have come to us with projects in the past,” Newby said. “They send us what they have problems with: poor image quality, skewed images, partially redacted words, handwriting.”

In addition to Cullable’s core OCR service, machine translation (MT) is integrated into the application. “Really good OCR machine translation sings and dances,” Newby said. “We use the Google Translate API because it’s native to our stack in Google.” Of course, a language service provider with its own proprietary MT engine would use that instead.

Improved OCR on the Horizon?

Looking ahead, OCR still stands to benefit from research. A September 2020 paper details how two researchers in Argentina created a dataset of annotated images from Japanese manga. The goal: enable OCR in manga at the pixel level.

Existing annotated, pixel-level datasets, the authors wrote, typically consist of real-world images, which lack speech balloons. Most of the text is usually in English, and is rarely hand-drawn in artistic styles, as in manga. Although this specific dataset was designed around manga, the principles behind it could be applied to OCR of Japanese texts in other domains.

A recent literature review, published in July 2020, laid out the limitations of OCR research thus far. First, most research deals with the most widely spoken languages on the planet, partly because datasets are often unavailable for languages with fewer speakers. It can also be difficult for systems to recognize characters handwritten by many different people, each with their own distinct handwriting.

Interest continues to grow in OCR of “text in the wild” — that is, on-screen characters and text in different settings — which might eventually be relevant to translators dealing with text in streaming media. But that may depend on the potential earnings at stake.

The authors concluded that the commercialization of research needs to improve to help build “low-cost, real-life systems for OCR that can turn lots of invaluable information into searchable/digital data.”