Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.
翻译:构建鲁棒的任务型语音对话代理需要充分接触人类通过语音进行交互的全部多样性。为此开发语音用户模拟器需要大规模涵盖语音用户行为的任务型语音对话数据,然而现有数据集在规模和领域覆盖上均有限,且缺乏系统化的数据增强流程。为解决这一问题,我们提出了 \textbf{SpokenTOD},这是一个包含 52,390 个对话、总计 1,034 小时语音的任务型语音对话数据集,其通过跨轮槽位、语音打断、非流利表达和情感韵律四种语音用户行为进行了增强,覆盖了多样化的说话人和领域。基于 SpokenTOD,我们提出了 \textbf{SpokenUS},这是一个基于任务型对话构建的语音用户模拟器,其采用了专为处理语音打断而设计的架构。SpokenUS 在目标覆盖度上可与规模大得多的模型相媲美,同时在人类平均意见得分上显著优于所有基线模型,并且能够像人类一样在对话过程中逐步披露槽位值,而非在对话初期就全部给出。进一步分析证实,SpokenUS 的语音行为对下游代理构成了有意义的挑战,使其成为训练和评估更鲁棒语音对话系统的实用工具。