Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user's contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup.
翻译:随着Google Assistant、Alexa和Siri等系统在日常生活中日益普及,任务导向型对话的研究兴趣持续增长。然而,由于缺乏能够真实捕捉用户广泛痛点场景的数据集,该领域学术研究的影响力始终受到限制。为探索解析真实对话中的若干挑战性难题,我们发布了PRESTO数据集——一个包含超过55万段人类与虚拟助手间多语种上下文对话的公开数据集。该数据集涵盖真实自然语言理解任务中出现的多样化挑战,包括不流畅表达、语码转换与修订等现象。作为唯一提供结构化上下文(如用户联系人列表及示例清单)的大规模人工生成对话解析数据集,PRESTO基于mT5模型的基线实验表明,其中蕴含的对话现象具有显著的建模难度,这一特征在低资源场景下尤为突出。