Researchers from Google, the University of Rochester, the University of California, and Columbia University have introduced a new dataset of over 550K multilingual conversations between humans and virtual assistants in various contexts, allowing for more realistic model training to optimize language model performance. Google also announced the new dataset in a blog post.
With the wide adoption of virtual assistants such as Google Assistant, Alexa, and Siri, researchers have taken an interest in the study of task-oriented dialogue; however, the lack of datasets that capture a wide range of user pain points has limited the impact of academic research in this field.
Although some custom datasets have been created, they do not have the typical speech phenomena necessary for model training, leading to underperforming models and dissatisfaction with assistant interactions.
The new dataset, coined PRESTO and released on March 17, 2023 spans six different languages (German, English, Spanish, French, Hindi, and Japanese) and contains a diverse array of challenges that occur in real-world natural language understanding (NLU) tasks, including disfluencies (e.g., repeated phrases and filler words), code-switching or code mixing (i,e., switching between or mixing words from two languages), and user revisions (i.e.,revising requests due to mistakes, changing or canceling requests).
Conversations by Native Speakers only
What sets PRESTO apart from other datasets is that it only includes conversations provided by native speakers of the language with no translation. As the authors of the research paper introducing the dataset explain, prior large multilingual datasets contain non-English conversations obtained by translating English conversations into other languages, “resulting in unnatural and synthetic utterances which are unlikely to be spoken by native speakers of the non-English language.”
A typical user interacts with virtual assistants in a virtual world (i.e., context) that may contain structured objects, such as a list of contacts on the user’s phone, a shopping list, or a to-do list. According to the authors, PRESTO “is the only large-scale human generated conversational parsing dataset that provides structured context such as a user’s contacts and lists for each example.”
They explained that, depending on the query, this context may or may not be needed to correctly interpret the user’s utterances. Semantic parsing models often struggle to determine which part of the context is relevant to a given utterance (if any). Therefore, the authors emphasized that “modeling solutions should have the ability to model (and ignore) such structured information.”
Realistic and Complex Utterances
The release of this dataset highlights the need for realistic and complex utterances to improve the performance of virtual assistants and provides researchers with a tool to explore new models and algorithms that can better handle the challenges associated with task-oriented dialogues. Overall, the creation of PRESTO is a significant step forward in the advancement of natural language processing (NLP) and the development of virtual assistants, according to the authors.
“With the release of this dataset, we open more questions than we answer, and we hope the research community makes progress on utterances that are more in line with what users are facing every day,” they said.
Authors: Rahul Goel, Waleed Ammar, Aditya Gupta, Siddharth Vashishtha, Motoki Sano, Faiz Surani, Max Chang, Hyun Jeong Choe, David Greene, Kyle He, Rattima Nitisaroj, Anna Trukhina, Shachi Paul, Pararth Shah, Rushin Shah, Zhou Yu