Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.
翻译:多模态奖励模型(MM-RMs)对于将大语言模型(LLMs)与人类偏好对齐至关重要,尤其是在LLMs日益与多模态数据交互的背景下。然而,我们发现,基于现有数据集训练的MM-RMs往往难以泛化到分布外数据,这归因于其对单模态伪相关性的依赖——主要是训练分布内仅基于文本的捷径,这阻碍了模型利用真正的多模态奖励函数。为解决此问题,我们提出了一种捷径感知的MM-RM学习算法,该算法通过动态重加权训练样本、将分布转向更好的多模态理解以及减少对单模态伪相关性的依赖,从而缓解这一问题。我们的实验表明,该方法在泛化能力、下游任务性能和可扩展性方面均取得了显著提升,为多模态奖励建模建立了一个更稳健的框架。