This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights. Additionally, we propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models (LLMs). Experimental results demonstrate that finetuned auditory LLMs can reliably generate detailed descriptions of noise and distortion, effectively identifying their types and temporal characteristics. The results further highlight the potential for incorporating reasoning to enhance the accuracy and reliability of quality assessments. The dataset will be released at https://huggingface.co/datasets/tsinghua-ee/QualiSpeech.
翻译:本文探索了一种利用自然语言描述进行语音质量评估的新视角,相比传统数值评分方法,能够提供更丰富、更细致的洞察。自然语言反馈可提供具有指导性的改进建议与详细评估,然而现有数据集缺乏支持此方法所需的全面标注。为弥补这一空白,我们提出了QualiSpeech——一个涵盖11个关键维度、包含详细自然语言评论(含推理过程与语境信息)的综合性底层语音质量评估数据集。此外,我们建立了QualiSpeech基准测试,用于评估听觉大语言模型在底层语音理解方面的能力。实验结果表明,经微调的听觉大语言模型能够可靠地生成关于噪声与失真的详细描述,有效识别其类型与时间特性。结果进一步揭示了引入推理机制对提升质量评估准确性与可靠性的潜力。本数据集发布于 https://huggingface.co/datasets/tsinghua-ee/QualiSpeech。