Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/SpokenWOZ-github.io/.
翻译:任务型对话(TOD)模型近年来取得了显著进展。然而,现有研究主要基于标注者撰写的书面数据集,导致学术研究与真实口语对话场景之间存在鸿沟。尽管已有部分小规模口语TOD数据集用于处理语音识别错误等鲁棒性问题,但这些数据集忽视了口语对话中的独特挑战。为突破上述局限,我们提出SpokenWOZ——面向口语TOD的大规模语音-文本数据集,涵盖8个领域、20.3万轮次、5700组对话及249小时人机口语对话音频。该数据集进一步整合了口语常见特征,例如逐词处理机制与口语推理过程。基于这些特征,我们提出跨轮次槽位检测与推理槽位检测作为新挑战。我们在多种基线模型上开展实验,包括文本模态模型、新提出的双模态模型及大语言模型(如ChatGPT)。结果表明,现有模型在口语对话场景中仍存在显著提升空间——当前最先进的对话状态追踪器联合目标准确率仅为25.65%,端到端模型在52.1%的对话中才能正确完成用户请求。数据集、代码及排行榜已开源:https://spokenwoz.github.io/SpokenWOZ-github.io/