In a paper published on October 23, 2023, a group of researchers from the University of Amsterdam and the Huawei Amsterdam Research Center demonstrated the practical application of large language models (LLMs) in data cleaning.
The researchers emphasized that LLMs are highly skilled at removing noise from datasets and they focused on the cleaning of the MTNT (Machine Translation of Noisy Text) dataset as a case in point. The resulting dataset, called C-MTNT, exhibits significantly less noise in the target sentences while preserving the semantic integrity of the original sentences.
“To the best of our knowledge, this is the first study to apply LLMs in the context of cleaning data,” they said.
Their primary goal in this research was to leverage LLMs to effectively eliminate noise and, in doing so, generate cleaner parallel language datasets. As they explained, these refined datasets serve as a valuable resource for evaluating the robustness of neural machine translation (NMT) models when dealing with noisy input.
The MTNT dataset is a well-established resource used for evaluating the robustness of NMT models when they encounter noisy input. “MTNT stands as one of the few well-established resources for evaluating NMT models’ performance in the presence of noise,” said the researchers.
However, despite its value, MTNT has limitations due to the presence of noise in target sentences. This limitation hinders its effectiveness as an evaluation tool for NMT models. Thus, the primary aim of data cleaning in this context was to make MTNT more suitable for evaluating NMT models.
Traditionally, data cleaning approaches involved filtering out undesirable sentences while retaining high-quality ones, often relying on predefined rules. However, these methods had limitations, as they couldn’t address every possible source of noise and struggled to identify natural noise introduced by human input.
In response to these limitations, the researchers proposed the use of LLMs in data cleaning, with GPT-3.5 being the chosen model. They specifically employed the original GPT-3.5 variant, highlighting that publicly available pre-trained LLMs like Llama 2, might also have this ability.
Semantic Integrity
The researchers explained that the task of employing LLMs for data cleaning poses several challenges that require meticulous attention. LLMs must effectively cleanse target sentences by eliminating diverse forms of noise, including semantically meaningless emojis, the transformation of emojis with semantic content into words, and the correction of misspellings.
Moreover, the cleaned target sentence must preserve the original semantic content, ensuring it conveys the intended meaning of the noisy source sentence, thus maintaining the accuracy and fidelity of the translation.
They designed a set of few-shot prompts to guide the LLM in data cleaning in three different scenarios, taking into account the availability of the language resources:
- Bilingual cleaning — using both noisy source and target samples as input, with a focus on cleaning the target sample while keeping it aligned with the source sample.
- Monolingual cleaning — using a noisy target sample as input and generating a clean target sample as the output.
- Translation — taking a noisy source sample as input and producing a clean target sample as output.
By measuring the noise frequency in the cleaned target sentences, measuring the semantic similarity between noisy and cleaned target sentences, and evaluating with human annotators and GPT-4, they showed that the proposed methods can effectively remove natural noise with LLM, while preserving the semantic structure.
Moreover, this approach surpassed conventional data cleaning methods by not just removing undesirable sentences but by addressing language intricacies, such as emojis, slang, jargon, and profanities. Additionally, it demonstrated that cleaned data could be generated without significantly reducing the overall sample size.
Beyond its effectiveness in data cleaning, the research also unveiled another remarkable capability of LLMs: the generation of high-quality parallel data even in resource-constrained settings. “This finding has significant implications for low-resource domains and languages, where acquiring parallel corpora is often challenging,” the researchers concluded.
Authors: Quinten Bolding, Baohao Liao, Brandon James Denis, Jun Luo, and Christof Monz