Big tech’s ever-growing appetite for language data has been central to the recent success of companies providing training data for artificial intelligence (AI).
Market leader Appen — a global firm providing high-quality training datasets for AI through a combination of software and services — has experienced exceptional growth since going public in 2015. Meanwhile, new entrants to the space, such as Scale AI, have attracted much interest from investors.
Appen CEO, Mark Brayan, joined the lineup at SlatorCon San Francisco in September 2019 to speak about his company’s approach to language data collection and curation.
According to CEO Brayan, the better the underlying data, the better the performance of the AI; and without good quality, you run the risk of getting “garbage in, garbage out.”
A meaningful part of Appen’s business, Brayan said, is related to collecting speech data for training speech recognition software and other speech applications using a global crowdsourcing model. Depending on the specific use case, Appen must seek out speech data that matches not only the language requested but also the accent, age, and pitch of the speaker.
“The data must fit the use case,” Brayan explained. For example, if the speech recognizer is in-car, the speech data should be collected on the road so the data matches the acoustic environment of the real-world use case. Likewise, if the use case is children playing video games, this exact scenario must be replicated during language collection to achieve the best results.
The speech data must also be broad-reaching. “It’s not just the actual words,” Brayan said, “it’s also the variety of the words that comes into play because we don’t all say the same thing.” For example, when asking a speech recognizer the price of something, speakers might ask “how much does it cost?” or “what’s the cost?” The data collected needs to encompass all possible variations.
After data collection comes the transcription step, which Appen also handles using a crowdsourced model, Brayan said. Following transcription, the final step is for linguists to annotate the data to indicate how it is pronounced, compiling something called a “pronunciation dictionary.” The data is then fed into “whatever application you’re using: a virtual assistant, or a search engine, or online ecommerce,” he added.
Technology is core to the entire process, of course. Asked how technology-heavy Figure Eight, which Appen acquired in March 2019, factors into operations, Brayan said most of the tools the crowd uses to transcribe, annotate, and perform relevance judgements come from Figure Eight. They also have a “self-service interface” that serves as an ordering portal for customers.
Scale and Complexity
While it is generally a case of “the more data the better,” Brayan also said that it is important for customers to know exactly what they are looking for. “The amount of data you collect is typically dictated by the budget you have, so you want to be really careful that you’re collecting what you need,” Brayan cautioned.
Price is not the only consideration: “Data is not only expensive to collect; it is also complicated to collect and it’s complicated to work with,” Brayan said. Factors such as the variety of speech data, technical considerations, and the huge recruitment effort required, all contribute to this complexity, he added.
“Data is not only expensive to collect; it is also complicated to collect and it’s complicated to work with” — Mark Brayan, CEO, Appen
Appen’s recruitment operations are indeed vast. The company has a team of recruiters working out of the Philippines 24 hours a day. “They process 100,000 job applications for crowd work every month,” and Appen places a couple of hundred of online ads each day in multiple locations, Brayan said. Appen has over a million crowd workers and the company pays 40,000–50,000 people to work for them every month, he said.
Data Privacy
In the wake of the recent controversy around a number of companies accessing speech data without users’ permission, an audience member asked Brayan whether Appen has been affected by issues relating to data privacy.
According to Brayan, Appen has been largely unaffected since they “deal with very little of what we call live data; e.g., stuff that comes directly out of people’s phones that goes back to the company that owns the phone.” He elaborated, “Most of the data we deal with is engineered, it’s collected, and we get people’s permission to use their data.”

Although in Brayan’s view, “everybody is trying to reduce their reliance on data because data is expensive” (and also complex to work with), AI’s appetite for language data shows no sign of dampening. Despite the challenges, Brayan believes that “it’s pretty irrefutable that the techniques that work, including deep learning, rely on a lot of data. They don’t seem to be topping out with the amount of data.”
This signals good news for the AI-support services market; both for those companies dedicated to this niche, and for language service providers, such as Lionbridge and Flitto, who provide language data operations among their service offerings.