Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly ``human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under `` listening-while-speaking'' conditions. This paper summarizes the dataset, track configurations, and the final results.
翻译:在大语言模型(LLMs),特别是音频大语言模型和全模态模型的快速发展推动下,语音对话系统已取得显著进展,逐步缩小了人机交互与人际交互之间的差距。实现真正“类人”的交流需要双重能力:一是感知并回应用户情感状态的情商,二是在动态、自然的对话流(如实时话轮转换)中进行稳健交互的机制。为此,我们在ICASSP 2026上发起了首届类人语音对话系统挑战赛(HumDial),旨在为这两项能力建立基准评测。该挑战基于一个源自真实人类对话的大规模数据集,建立了一个涵盖两个赛道的公平评估平台:(1)情商赛道,专注于长期情感理解与共情生成;(2)全双工交互赛道,系统评估“边说边听”条件下的实时决策能力。本文概述了数据集、赛道设置及最终结果。