Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user's contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup.
翻译:随着谷歌助手、Alexa和Siri等系统在日常生活中日益普及,任务导向型对话的研究兴趣持续增长。然而,该领域学术研究的影响力受到数据集的限制——现有数据难以真实反映用户面临的各类痛点。为推动解析真实对话中更具挑战性问题的研究,我们发布了PRESTO数据集:包含超过55万段人类与虚拟助手之间的多语言上下文对话。该数据整合了真实自然语言理解任务中的多种挑战性现象,如不流利表达、语码转换和语句修正。作为唯一提供结构化上下文(如用户联系人及列表)的大规模人工生成对话解析数据集,PRESTO为每个示例均附带了精细的上下文信息。基于mT5模型的基线实验表明,PRESTO中蕴含的对话现象对建模构成显著挑战,尤其在低资源场景下更为突出。