While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs
翻译:尽管语音大语言模型(Speech-LLMs)在许多应用中展现出强大的性能,但其鲁棒性尚未得到充分检验,尤其是在处理语音不流畅性方面。现有评估通常依赖于理想化输入,忽视了常见的不流畅现象,特别是与帕金森病等疾病相关的不流畅特征。本研究探讨了当前Speech-LLMs在与存在言语障碍的用户交互时能否保持性能。为此,我们提出了VocalBench-DF框架,该框架基于多维分类体系对不流畅性进行系统性评估。通过对22个主流Speech-LLMs的评估,我们发现模型性能出现显著下降,表明其实际应用准备度仍显不足。进一步分析指出,音素级处理与长上下文建模是导致这些失败的主要瓶颈。通过增强组件与流程中的识别与推理能力,可显著提升模型鲁棒性。这些发现凸显了开发新方法以改进不流畅性处理、构建真正包容性Speech-LLMs的迫切需求。