MMSU：大规模多任务口语理解与推理基准 (MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark)

from arxiv, MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench

Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.

翻译：语音本身包含远超文本语言的丰富声学信息。在现实世界的口语理解中，有效解读通常需要整合语义（如内容）、副语言特征（如情感、语速、音高）以及内嵌于语音中的音系特征（如韵律、语调、节奏）。尽管近期的多模态语音大语言模型（SpeechLLMs）在处理音频信息方面展现出卓越能力，但它们在自然语音中进行细粒度感知与复杂推理的能力在很大程度上仍未得到充分探索。为填补这一空白，我们提出了MMSU，一个专门为口语理解与推理设计的综合性基准。MMSU包含跨越47个不同任务的5,000个精心构建的音频-问题-答案三元组。为使基准植根于语言学理论，我们系统性地纳入了广泛的语言现象，涵盖语音学、韵律学、修辞学、句法学、语义学和副语言学。通过对14个先进SpeechLLM的严格评估，我们发现现有模型仍有巨大的改进空间，并指出了未来优化的有意义方向。MMSU为口语理解的全面评估设立了新标准，为开发更复杂的人机语音交互系统提供了宝贵洞见。MMSU基准可在 https://huggingface.co/datasets/ddwang2000/MMSU 获取。评估代码可在 https://github.com/dingdongwang/MMSU_Bench 获取。