We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.
翻译:我们提出TuneJury,一种用于文本到音乐生成的开源实例级成对奖励模型,能够根据文本提示和音频片段预测音乐偏好得分。该发布版本基于覆盖竞技场式(A vs. B)投票、度量对齐偏好对、众包成对比较及专家审美评分的公开人类偏好标签进行训练。两个音频片段间的预测得分差在我们保留的测试划分上表现出良好校准性,支持通过简单得分阈值进行数据筛选。TuneJury在保留测试对和分布外基准测试中均具有泛化能力,在后者上仍能保持与先前基线的竞争力。针对训练后发布的生成器,我们引入锚校准——一种事后逐系统Bradley-Terry校准方法,能以显著优于从头再训练的数据效率恢复一致性。相同的冻结奖励模型在下游三个应用中带来持续的奖励轴增益:推理时的最佳候选N选一、DITTO风格潜在优化以及专家迭代后训练。TuneJury代码已开源至https://github.com/yonghyunk1m/TuneJury。