Task-oriented dialogue (TOD) models have great progress in the past few years. However, these studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and more realistic spoken conversation scenarios. While a few small-scale spoken TOD datasets are proposed to address robustness issues, e.g., ASR errors, they fail to identify the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, which consists of 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ incorporates common spoken characteristics such as word-by-word processing and commonsense reasoning. We also present cross-turn slot and reasoning slot detection as new challenges based on the spoken linguistic phenomena. We conduct comprehensive experiments on various models, including text-modal baselines, newly proposed dual-modal baselines and LLMs. The results show the current models still has substantial areas for improvement in spoken conversation, including fine-tuned models and LLMs, i.e., ChatGPT.
翻译:任务型对话模型在过去几年中取得了显著进展。然而,这些研究主要聚焦于由标注者编写的数据集,导致学术研究与更真实的口语对话场景之间存在差距。尽管已有少量小规模口语任务型对话数据集被提出以解决鲁棒性问题(如ASR错误),但它们未能识别口语对话中的独特挑战。为应对这些局限,我们提出了SpokenWOZ——一个面向口语任务型对话的大规模语音-文本数据集,包含8个领域、20.3万轮次、5.7千段对话以及249小时来自人际口语对话的音频。SpokenWOZ融入了常见口语特征,如逐词处理与常识推理。我们还基于口语语言现象提出了跨轮槽位检测与推理槽位检测作为新的挑战。我们对多种模型进行了全面实验,包括文本模态基线、新提出的双模态基线与大语言模型。结果表明,当前模型(包括微调模型与大语言模型,如ChatGPT)在口语对话中仍存在显著改进空间。