While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
翻译:尽管音乐生成模型已发展到能够处理文本、歌词和参考音频混合的复杂多模态输入,但评估机制却相对滞后。本文通过构建组合式多模态指令(CMI)场景下的音乐奖励建模完整生态体系,填补了这一关键空白。在该体系中,生成音乐可同时受文本描述、歌词及音频提示的条件约束。我们首先提出了CMI-Pref-Pseudo——一个包含11万伪标注样本的大规模偏好数据集,以及CMI-Pref——专为细粒度对齐任务设计的高质量人工标注语料库。为统一评估标准,我们构建了CMI-RewardBench基准测试框架,该框架从音乐性、文本-音乐对齐度和组合式指令对齐度三个维度,对异构样本上的音乐奖励模型进行系统评估。基于这些资源,我们开发了CMI奖励模型(CMI-RMs)——一个能够处理异构输入且参数高效的奖励模型系列。我们在CMI-Pref及既有数据集上评估了这些模型在音乐性和对齐度方面与人类评分的相关性。进一步实验表明,CMI-RM不仅与人类评判保持高度相关,还能通过top-k过滤机制实现有效的推理时缩放。相关训练数据、基准测试框架及奖励模型均已开源发布。