A Warning About Data Annotation as a Business From a Man Who Sells AI to AI Firms

Acrolinx CEO and Founder Andrew Bredenkamp at SlatorCon Amsterdam 2019

“When it comes to delivering value, none of the algorithms will deliver value without data. Data trumps algorithms every time,” Acrolinx Founder Andrew Bredenkamp told the SlatorCon Amsterdam 2019 audience. Acrolinx is an AI-powered content management platform that uses natural language processing (NLP) technology on a large scale for clients across a range of industries.

Quoting what, by now, has become a widely accepted statement, “Data is the new oil,” Bredenkamp explained how, along with data comes the massive data annotation industry. Valued at USD 0.5bn in 2018, Bredenkamp said data labeling will more than double in the next four years. An August 2019 article in The New York Times estimated that data annotation accounts for 80% of the time spent building AI technology.

Little wonder, then, that some language service providers (LSPs) are eager to grab a piece of the data annotation market, which companies like Appen have historically dominated and where well-funded startups like Scale are competing.

Bredenkamp cautioned that, in his view, companies providing such middlemen services are starting to get squeezed: Individual annotators have become more organized and have begun advertising their services in the freelance data market, and data requirements are actually falling.

“What the AI community is working really hard on is removing this (data annotation) bottleneck to building applications,” Bredenkamp said. “Obviously, if you build algorithms, the last thing you want to do is spend 80% of your time doing something else.”

Several solutions have already emerged as potential alternatives to data annotation.

In synthetic labeling, two AI systems interact with and test each other to artificially create and label content, a process that Bredenkamp said is starting to look similar to what humans do.

Facebook and other groups have used unsupervised approaches to building AI applications. One method that has been used to build MT systems is merging huge, independent language datasets without parallel language data and allowing a system to see how vectors start to cluster. The system then “learns” how words behave in context.

Content robots have also been able to generate text in “simple” areas. Bredenkamp pointed out, however, that sentences may sound plausible when read individually, but the text generally lacks coherence as a whole.

Andrew Bredenkamp, Acrolinx at SlatorCon Amsterdam 2019
Andrew Bredenkamp, Acrolinx

This shortcoming echoes another challenge in AI: the significant gap between merely seeing a correlation and tapping actual, real-world knowledge.

According to Bredenkamp, AI has struggled with disambiguation (i.e., interpreting linguistic ambiguity) for the past 50 years. Still, there have been recent gains: Vector composition has helped AI learn word relationships that three or four years ago were impossible — and big tech companies have sponsored initiatives supporting research on building knowledge for AI systems.

“If you build algorithms, the last thing you want to do is spend 80% of your time doing something else” — Andrew Bredenkamp, Founder, Acrolinx

“The open question is: Will AI be able to learn knowledge like humans learn knowledge, by observation of data?” Bredenkamp asked the audience.

LSPs in an AI World

Commenting more specifically on the language services industry, Bredenkamp formulated a view on how LSPs may want to position themselves.

To gain a significant competitive advantage, Bredenkamp said, they have to adopt a hyperlocal strategy and look beyond the small set of languages many LSPs have worked with traditionally.

An August 2019 KPMG report, for example, found that non-English use of the Internet is growing six times as fast as English use, and a November 2019 article by The Economist described how China is waking up to the commercial value of dialects.

Bredenkamp said, “Increasingly, brands are translating the user experience into those languages (e.g., dialects) to drive better market penetration in the regions, and these are not small regions. They’re targeting hundreds of millions of users.”

The Acrolinx Founder said he also expects human translation to become rarer for well-resourced language pairs due to high quality systems like DeepL and Google Translate. He said he has seen big organizations completely move away from human translation for certain types of content.

The hope then lies in transcreation, which is still far from being automated.

Bredenkamp, a self-described optimist, rejects the fear-mongering that often comes with predictions about AI, and even welcomes the possibility of AI surpassing the abilities of the world’s smartest people.

“Why should it stop there? We aren’t the limit,” Bredenkamp said. “I think this will be good for us. Let’s all get used to it. It will be a brave new world.”