Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.
翻译:评估文本到音乐(TTM)系统仍然成本高昂,因为音乐印象(MI)和文本对齐(TA)得分依赖于人类平均意见得分(MOS)。大多数自动MOS估计器采用逐点回归或分布分类进行训练。这些目标函数无法直接优化基于排序的指标,且对跨模态一致性缺乏有效的几何约束。为解决这些问题,我们提出DeRA-MOS,一种用于TTM评估的解耦优化框架。针对MI,我们引入批感知列表排序损失,该损失在每批数据内建模相对顺序,能更好地与基于斯皮尔曼秩相关系数(SRCC)的评估对齐。针对TA,我们引入分数锚定的模态对齐损失,将人类评分映射到目标音频-文本相似度,并在融合前对潜在空间进行正则化。通过有效缓解逐点训练不匹配与模态漂移问题,在MusicEval数据集上的实验表明,我们的解耦框架在MI和TA排序指标上均取得显著提升,为大规模TTM评估建立了稳健范式。