Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.
翻译:生成式语音技术正在快速发展,但评估合成语音的感知质量仍然是一个核心挑战。现有方法通常依赖于标量分数或二元决策,这缺乏可解释性,且难以跨任务和跨语言泛化。我们提出了SpeechLLM-as-Judges,这是一种新范式,旨在使大型语言模型能够进行结构化、基于解释的语音质量评估。为支持这一方向,我们引入了SpeechEval,这是一个大规模数据集,包含32,207个多语言语音片段和128,754条标注,涵盖四项任务:质量评估、成对比较、改进建议和深度伪造检测。基于此资源,我们开发了SQ-LLM,这是一个具备语音质量感知能力的LLM,通过思维链推理和奖励优化进行训练以提升其能力。实验结果表明,SQ-LLM在各项任务和语言上均表现出色,揭示了该范式在推进语音质量评估方面的潜力。相关资源将开源。