The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from "garbage music". Curiously, we observe that the standard cross-entropy loss -- a core training metric -- often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model's loss reacting positively to these perturbations, specifically a sharp increase ("Peak" area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve -- rather than its absolute value -- encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality -- opening the door to more principled training objectives and sharper benchmarks.
翻译:随着音乐大语言模型的兴起,亟需稳健的方法来评估其输出质量,尤其是在区分高质量作品与"垃圾音乐"方面。有趣的是,我们观察到标准交叉熵损失——作为核心训练指标——在模型遇到系统性损坏的音乐时常常会下降,这削弱了其作为独立质量指标的有效性。为探究这一悖论,我们引入了噪声注入实验,将不同长度的受控噪声信号注入音乐上下文中。我们假设模型损失对这些扰动产生正向反应,特别是对短时注入出现急剧上升("峰值"区域),可作为模型辨别音乐完整性的能力代理。在音频波形域对MusicGen模型进行的实验证实,音乐大语言模型对局部纹理层面的干扰比对全局语义损坏的反应更为强烈。除了揭示这种偏差,我们的结果还突显了一个新原理:损失曲线的形态——而非其绝对值——编码了关于生成内容质量(即模型行为)的关键信息。我们设想这种基于损失曲线的评估方法,可作为一种无标签、模型内禀的框架来评估音乐质量,为更原则性的训练目标和更精准的基准测试开辟道路。