Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.
翻译:摘要:副语言线索对于自然的人机交互至关重要,然而当前大型音频-语言模型(LALMs)对其评估仍受限于粗粒度的特征覆盖以及评估本身固有的主观性。为解决这些挑战,我们提出了SpeechParaling-Bench——一个面向副语言感知语音生成的综合性基准测试。该基准将现有覆盖范围从不足50个细粒度特征扩展至超过100个,并依托1000余条英汉平行语音查询语句,划分为三个递进难度的任务:细粒度控制、语句内变化以及上下文感知自适应。为实现可靠评估,我们进一步开发了成对比较流程,由基于LALM的评判器将候选响应与固定基线进行比对。通过将评估框架从绝对评分转为相对偏好,该方法减轻了主观性,在避免昂贵人工标注的同时,实现了更稳定且可扩展的评估。大量实验揭示了当前LALM存在的显著局限。即便是领先的专有模型,在副语言特征的全面静态控制与动态调制方面仍存在困难,而未能正确诠释副语言线索的错误占情景对话中总错误的43.3%。这些发现凸显了为实现与人类对齐的语音助手,亟需更鲁棒的副语言建模。