The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.
翻译:指令引导音频生成的快速发展凸显了对鲁棒对齐评估的迫切需求。当前自动化评估方法过度依赖通用大语言模型的整体评分,这类方法难以解耦复杂指令、缺乏可解释性,且无法捕捉细粒度属性不匹配。为解决这一问题,我们提出了一种基于动态评分规则的新型评估范式,该范式能自适应地将复杂音频描述分解为可变数量的独立、可验证的二元评分项。为严格检验该能力,我们构建了AnyAudio-Judge Bench——一个包含7920个精心筛选样本的全双语基准,覆盖语音、声音、音乐及混合音频四大领域,并特意设置了具有挑战性的困难负样本。此外,我们构建了包含105K样本的大规模语料库,配备显式思维链推理过程,用以训练专用评估模型AnyAudio-Judge。通过采用监督微调结合群体相对策略优化的训练流程,该模型成功将其推理路径与基于评分规则的评估机制对齐。大量实验表明,AnyAudio-Judge不仅能在零样本对齐检测中显著超越现有最优基线,更能提供精准可解释的奖励信号,有效提升下游音频生成强化学习中的指令对齐效果。