With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.
翻译:随着先进推理能力在口语对话模型中的快速集成,该领域亟需超越简单交互、应对现实世界复杂性的基准测试。然而,当前的评估主要遵循文本生成标准,忽视了副语言特征和口语化表达所特有的以音频为中心的特性,以及现代智能体所需的认知深度。为弥补这一差距,我们提出了WavBench——一个旨在评估现有工作未能涵盖的现实对话能力的综合性基准。WavBench独特地构建了一个三重框架:1) 专业子集,旨在通过显著提升难度来严格挑战具备推理增强能力的模型;2) 基础子集,为口语化表达定义了新标准,其通过自然词汇、语言流畅度和交互亲和力来优先考量“可听性”,而非僵化的书面准确性;3) 声学子集,涵盖显式理解、生成及隐式对话,以在真实世界场景中严格评估全面的副语言能力。通过对五个最先进模型的评估,WavBench为复杂问题解决、口语化表达与副语言保真度的交叉领域提供了关键见解,从而指导鲁棒口语对话模型的演进。基准数据集与评估工具包发布于 https://naruto-2024.github.io/wavbench.github.io/。