Automatic mean opinion score (MOS) prediction serves as a principled alternative to both subjective listening tests and objective metrics, providing scalable and consistent audio evaluation. Inspired by the LLM-as-Judge paradigm, recent multimodal large language models offer strong perceptual modeling and reasoning capabilities, enabling audio quality assessment. In this work, we address the challenging problem of audio editing evaluation and propose the first natural language-based automated evaluation framework built upon Qwen2-Audio. Two caption-based fine-tuning tasks are introduced to enhance multi-audio understanding, together with a designed Chain-of-Thought prompting strategy to encourage structured, step-by-step reasoning. Experiments show that our framework produces interpretable and logically consistent text-based evaluations, aligning closely with human judgments while outperforming existing baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.
翻译:自动平均意见分预测是主观听力测试和客观指标的原理性替代方案,可提供可扩展且一致的音频评估。受大语言模型即评判者范式的启发,近期的多模态大语言模型展现出强大的感知建模与推理能力,可支持音频质量评估。本研究针对音频编辑评估这一挑战性问题,提出首个基于自然语言的自动化评估框架,该框架建立在Qwen2-Audio之上。我们引入了两项基于描述的微调任务以增强多音频理解能力,同时设计了一种链式思维提示策略以鼓励结构化、逐步推理。实验表明,本框架能生成可解释且逻辑连贯的基于文本的评估结果,其与人类判断高度一致,同时优于现有基线模型。相关代码与演示可在https://github.com/NKU-HLT/Eval_Reasoning获取。