For Dataset Creation, Humans Should Evaluate Machine Text — New Research

Natural Language Data Collection Research

New research on natural language processing (NLP) offers a paradigm for human-AI collaboration, which could eventually influence the data collection and evaluation that language service providers (LSPs) perform for clients.

LSPs are increasingly moving into the data-for-AI space, competing with longtime AI data leader Appen and startups such as To name just one example, Polish LSP Summa Linguae announced its EUR 5m (USD 5.7m) acquisition of Belgium-based language data provider Datamundi in December 2021.

On Twitter, lead author Alisa Liu shared a January 2022 paper that explores a new pipeline for dataset creation. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation is the result of a collaboration by Liu of the University of Washington and Swabha Swayamdipta of the Allen Institute for Artificial Intelligence (AI2). The two other contributors, Noah A. Smith and Yejin Choi, are affiliated with both organizations.

Until now, crowdsourcing dataset creation might have seemed like the best option for quickly generating a massive number of free text examples, but the authors pointed out that it is far from perfect. 

“While human annotators are generally reliable for writing correct examples, crafting diverse and creative examples at scale can be challenging,” the researchers wrote. “Thus, crowdworkers often resort to a limited set of writing strategies for speed, at the expense of diversity.”

In particular, they note, models for NLI (i.e., predicting whether a premise statement entails, contradicts, or is neutral to a hypothesis statement) struggle to perform well on out-of-domain instances, “suggesting they have overfit to existing datasets.”

At the same time, massive language models, such as GPT-3, continue to make headlines for their remarkably human-like, open-ended text generation.

The researchers believe their new system “brings together the generative strength of language models and the evaluative strength of humans” — perhaps in a way reminiscent of recent trends observed in MT, in which an MT model tackles the source text in a first round of translation and human linguists then refine the output via post-editing.

Best of Both Worlds

The four-step pipeline starts with the existing large-scale, multi-genre dataset MultiNLI. Using a method known as dataset cartography, the team automatically identified pockets of examples in the dataset that showed “challenging reasoning patterns relative to a trained model.”

SlatorCon Silicon Valley 2024 | $ 1,340

SlatorCon Silicon Valley 2024 | $ 1,340

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

Researchers then used a pretrained language model to generate new examples likely to have the same pattern. While the prompt consisted of examples that shared a label, the language model did not necessarily “see” the label.

The data maps inspired a new metric, which the team employed to automatically filter generated examples for the most ambiguous samples, since those are most likely to aid model learning.

The final step was human review. Sixty-two Amazon Mechanical Turk crowdworkers evaluated, and, when necessary, revised 118,724 generated examples for quality. Two crowdworkers analyzed each example.

Researchers discarded any example if either annotator chose to discard it, and kept a revision only if both annotators believed the example needed revision. In general, annotator revisions improved either fluency (targeting well-documented issues with text generation, such as redundancy and self-contradiction) or clarity (often resolving ambiguities in the example that made the entailment relationship difficult to determine).

The process resulted in a final dataset of 108,357 labeled examples. (The name of the dataset, WaNLI, is a homophone for a Chinese expression meaning “ten thousand reasoning,” a nod to its size.)

In a case of the student becoming the teacher, the team found that training a model on WaNLI alone consistently performed better on seven out-of-domain tests versus training on MultiNLI or combined training on WaNLI with MultiNLI.

Since the MultiNLI dataset is four times larger than WaNLI, the researchers concluded that more data is not necessarily better, especially when composed predominantly of easy-to-learn examples.

WaNLI also outperformed another dataset, Adversarial NLI, on all but two test sets. “This result is substantial because the creation pipeline of Adversarial NLI, which required annotators to craft examples that fool existing models, posed a much greater challenge for human workers and used more existing resources to train adversaries,” the authors wrote. 

By contrast, for WaNLI, language models do not necessarily need to “understand” the task in order to successfully create new examples; they simply have to replicate linguistic patterns.

“Our work suggests that a better way of eliciting human intelligence at scale is by asking workers to revise and evaluate content,” the researchers explained. “To this end, we hope to encourage more work in developing methods of leveraging advances in large pretrained language models to aid the dataset creation process.”