logo image
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • Design Thinking – February 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs
MENU
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • Design Thinking – February 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs

Advertise on Slator! Download the 2021 Online Media Kit Now

  • Slator Market Intelligence
  • Slator Advertising Services
  • Slator Advisory
  • Login
Search
Generic filters
Exact matches only
Advertisement
Here’s How Appen Collects the Language Data So Central to AI

1 year ago

September 27, 2019

Here’s How Appen Collects the Language Data So Central to AI

Features ·

by Esther Bond

On September 27, 2019

1 year ago
Features ·

by Esther Bond

On September 27, 2019

Here’s How Appen Collects the Language Data So Central to AI

Big tech’s ever-growing appetite for language data has been central to the recent success of companies providing training data for artificial intelligence (AI).

Market leader Appen — a global firm providing high-quality training datasets for AI through a combination of software and services — has experienced exceptional growth since going public in 2015. Meanwhile, new entrants to the space, such as Scale AI, have attracted much interest from investors.

Appen CEO, Mark Brayan, joined the lineup at SlatorCon San Francisco in September 2019 to speak about his company’s approach to language data collection and curation.

Advertisement

According to CEO Brayan, the better the underlying data, the better the performance of the AI; and without good quality, you run the risk of getting “garbage in, garbage out.”

A meaningful part of Appen’s business, Brayan said, is related to collecting speech data for training speech recognition software and other speech applications using a global crowdsourcing model. Depending on the specific use case, Appen must seek out speech data that matches not only the language requested but also the accent, age, and pitch of the speaker.

“The data must fit the use case,” Brayan explained. For example, if the speech recognizer is in-car, the speech data should be collected on the road so the data matches the acoustic environment of the real-world use case. Likewise, if the use case is children playing video games, this exact scenario must be replicated during language collection to achieve the best results.

Slator 2020 Language Industry Market Report

Data and Research, Slator reports
55 pages. Total market size, biz dev and sales insights, TMS & MT review, buyer segment analysis, M&A, Covid impact & outlook.
$480 BUY NOW

The speech data must also be broad-reaching. “It’s not just the actual words,” Brayan said, “it’s also the variety of the words that comes into play because we don’t all say the same thing.” For example, when asking a speech recognizer the price of something, speakers might ask “how much does it cost?” or “what’s the cost?” The data collected needs to encompass all possible variations.

After data collection comes the transcription step, which Appen also handles using a crowdsourced model, Brayan said. Following transcription, the final step is for linguists to annotate the data to indicate how it is pronounced, compiling something called a “pronunciation dictionary.” The data is then fed into “whatever application you’re using: a virtual assistant, or a search engine, or online ecommerce,” he added.

Technology is core to the entire process, of course. Asked how technology-heavy Figure Eight, which Appen acquired in March 2019, factors into operations, Brayan said most of the tools the crowd uses to transcribe, annotate, and perform relevance judgements come from Figure Eight. They also have a “self-service interface” that serves as an ordering portal for customers.

Scale and Complexity

While it is generally a case of “the more data the better,” Brayan also said that it is important for customers to know exactly what they are looking for. “The amount of data you collect is typically dictated by the budget you have, so you want to be really careful that you’re collecting what you need,” Brayan cautioned.

Price is not the only consideration: “Data is not only expensive to collect; it is also complicated to collect and it’s complicated to work with,” Brayan said. Factors such as the variety of speech data, technical considerations, and the huge recruitment effort required, all contribute to this complexity, he added.

“Data is not only expensive to collect; it is also complicated to collect and it’s complicated to work with” — Mark Brayan, CEO, Appen

Appen’s recruitment operations are indeed vast. The company has a team of recruiters working out of the Philippines 24 hours a day. “They process 100,000 job applications for crowd work every month,” and Appen places a couple of hundred of online ads each day in multiple locations, Brayan said. Appen has over a million crowd workers and the company pays 40,000–50,000 people to work for them every month, he said.

Data Privacy

In the wake of the recent controversy around a number of companies accessing speech data without users’ permission, an audience member asked Brayan whether Appen has been affected by issues relating to data privacy.

According to Brayan, Appen has been largely unaffected since they “deal with very little of what we call live data; e.g., stuff that comes directly out of people’s phones that goes back to the company that owns the phone.” He elaborated, “Most of the data we deal with is engineered, it’s collected, and we get people’s permission to use their data.”

Mark Brayan (Appen) and Esther Bond (Slator)

Although in Brayan’s view, “everybody is trying to reduce their reliance on data because data is expensive” (and also complex to work with), AI’s appetite for language data shows no sign of dampening. Despite the challenges, Brayan believes that “it’s pretty irrefutable that the techniques that work, including deep learning, rely on a lot of data. They don’t seem to be topping out with the amount of data.”

This signals good news for the AI-support services market; both for those companies dedicated to this niche, and for language service providers, such as Lionbridge and Flitto, who provide language data operations among their service offerings.

SlatorCon San Francisco 2019 Speaker Presentations

SCSF19 Presentation Mark (Appen)

2 MB

DOWNLOAD

TAGS

AIAI support servicesAppenartificial intelligencedata annotationdata collectiondata curationdata privacydata recognitiondata taggingFigure Eightlanguage dataLionbridgeMark BrayanScale AISlatorCon San Francisco 2019speech dataspeech recognizerstranscriptionWelocalize
SHARE
Esther Bond

By Esther Bond

Research Director at Slator. Localization enthusiast, linguist and inquisitor. London native.

Advertisement

SUBSCRIBE TO THE SLATOR WEEKLY

Language Industry Intelligence
In Your Inbox. Every Friday

SUBSCRIBE

SlatorSweepSlatorPro
ResearchRFP CENTER

PUBLISH

PRESS RELEASEDIRECTORY LISTING
JOB ADEVENT LISTING

Bespoke advisory including speaking, briefings and M&A

SLATOR ADVISORY
Advertisement

Featured Reports

See all
Slator 2020 Language Industry M&A and Funding Report

Slator 2020 Language Industry M&A and Funding Report

by Slator

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

by Slator

Slator 2020 Medtech Translation and Localization Report

Slator 2020 Medtech Translation and Localization Report

by Slator

Pro Guide: Sales and Marketing for Language Service Providers

Pro Guide: Sales and Marketing for Language Service Providers

by Slator

Press Releases

See all
XTRF Launches a Bi-Monthly Free Networking Event for Localization Professionals

XTRF Launches a Bi-Monthly Free Networking Event for Localization Professionals

by XTRF

150 Million Words Translated: the German EU Council Presidency Translator Sets New Records

150 Million Words Translated: the German EU Council Presidency Translator Sets New Records

by Tilde

BeLazy Announces Full Automation for Plunet

BeLazy Announces Full Automation for Plunet

by BeLazy

Upcoming Events

See All
  1. Memsource MT Post-Editing Pricing Models Webinar

    Pricing Models for MT Post-Editing Workshop

    by Memsource

    · February 3

    Hear a panel of innovative localization professionals share different approaches for MT post-editing pricing.

    More info FREE

Featured Companies

See all
Text United

Text United

Memsource

Memsource

Wordbank

Wordbank

Protranslating

Protranslating

Seprotec

Seprotec

Versacom

Versacom

SDL

SDL

Smartling

Smartling

Lingotek

Lingotek

XTM International

XTM International

Smartcat

Smartcat

Translators without Borders

Translators without Borders

STAR Group

STAR Group

memoQ Translation Technologies

memoQ Translation Technologies

Advertisement

Popular articles

Why Netflix Shut Down Its Translation Portal Hermes

Why Netflix Shut Down Its Translation Portal Hermes

by Esther Bond

The Slator 2020 Language Service Provider Index

The Slator 2020 Language Service Provider Index

by Slator

Top Language Industry Quotes of 2020

Top Language Industry Quotes of 2020

by Monica Jamieson

SlatorPod: The Weekly Language Industry Podcast

connect with us

footer logo

Slator makes business sense of the language services and technology market.

Our Company

  • Support
  • About us
  • Terms & Conditions
  • Privacy Policy

Subscribe to the Slator Weekly

Language Industry Intelligence
In Your Inbox. Every Friday

© 2021 Slator. All rights reserved.

Sign up to the Slator Weekly

Join over 13,000 subscribers and get the latest language industry intelligence every Friday

Your information will never be shared with third parties. No Spam.