Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.
翻译:多模态奖励模型对于将多模态大语言模型与人类偏好对齐至关重要。近期研究将推理能力融入此类模型,取得了令人瞩目的成果。然而,训练这些模型面临两大关键挑战:(1) 偏好数据集中固有的噪声会降低模型性能;(2) 传统训练方法效率低下,忽视了样本难度的差异性。本文发现响应熵与准确性之间存在强相关性,表明熵可作为标注噪声和样本难度的可靠无监督代理指标。基于此洞见,我们提出一种新颖的熵引导训练方法用于多模态推理奖励模型,该方法融合两种策略:(1) 熵引导数据筛选以减轻不可靠样本的影响;(2) 渐进引入复杂样本的熵引导训练策略。在三个基准测试上的大量实验表明,经EGT训练的模型持续优于最先进的多模态奖励模型。