Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.
翻译:奖励建模(Reward Modeling, RM)通过捕捉人类偏好来对齐大型语言模型(LLMs),在模型微调、响应筛选和排序等任务中应用日益广泛。然而,由于人类偏好的固有复杂性和可用数据集的有限覆盖,奖励模型在分布偏移或对抗性扰动下常出现失效。现有识别此类失效模式的方法通常依赖于对偏好分布或失效属性的先验知识,这限制了它们在缺乏此类信息的实际场景中的实用性。在本工作中,我们提出了一种可操作的、与偏好分布无关的方法,通过奖励引导的受控解码来发现奖励模型的失效模式。基于此,我们提出REFORM——一种自我改进的奖励建模框架,通过利用奖励模型本身指导生成错误评分的响应来增强鲁棒性。这些对抗性样本随后用于扩充训练数据,修补奖励模型的失配行为。我们在两个广泛使用的偏好数据集——Anthropic Helpful Harmless(HH)和PKU Beavertails上评估REFORM,证明其在不牺牲奖励质量的前提下显著提升了鲁棒性。值得注意的是,REFORM在直接评估和下游策略训练中均保持性能,并通过消除虚假相关性进一步改善了对齐质量。