CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgment scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. Code is available at GitHub (https://github.com/Haiwen-Xia/CMI-RewardBench). Model weights: CMI-RM (https://huggingface.co/HaiwenXia/CMI-RM). Datasets: CMI-Pref-Pseudo (https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo) and CMI-Pref (https://huggingface.co/datasets/HaiwenXia/cmi-pref)

翻译：尽管音乐生成模型已发展到能处理融合文本、歌词和参考音频的复杂多模态输入，但其评估机制仍相对滞后。本文通过为组合多模态指令（CMI）下的音乐奖励建模建立完整生态系统来填补这一关键空白——生成音乐可能受文本描述、歌词和音频提示共同约束。我们首先引入大规模偏好数据集CMI-Pref-Pseudo（包含11万条伪标签样本），以及为细粒度对齐任务定制的高质量人工标注语料库CMI-Pref。为统一评估生态，我们提出CMI-RewardBench基准——通过音乐性、文本-音乐对齐及组合指令对齐三个维度的异构样本评估音乐奖励模型。依托这些资源，我们开发了CMI奖励模型（CMI-RMs），这是一个能处理异构输入的高效参数化奖励模型家族。在CMI-Pref及现有数据集上，我们评估了该模型在音乐性与对齐任务中与人工评分的相关性。进一步实验表明，CMI-RM不仅与人工判断高度相关，还能通过top-k过滤实现有效的推理时缩放。代码（https://github.com/Haiwen-Xia/CMI-RewardBench）、模型权重（https://huggingface.co/HaiwenXia/CMI-RM）及数据集（CMI-Pref-Pseudo：https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo；CMI-Pref：https://huggingface.co/datasets/HaiwenXia/cmi-pref）均已开源。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

14+阅读 · 5月21日

【NeurIPS2025】MIDAS：一种基于错配的用于失衡多模态学习的数据增强策略

专知会员服务

10+阅读 · 2025年10月1日

【CMU博士论文】分析多模态机器学习模型性能及其在医学报告生成中的评估指标

专知会员服务

23+阅读 · 2024年12月16日

MME-Survey：多模态大型语言模型评估的综合性调查

专知会员服务

43+阅读 · 2024年12月1日