Currently, most reinforcement learning tasks focus on domains like mathematics and programming, where verification is relatively straightforward. However, in subjective tasks such as role-playing, alignment techniques struggle to make progress, primarily because subjective reward modeling using the Bradley-Terry model faces significant challenges when dealing with ambiguous preferences. To improve reward modeling in subjective tasks, this paper proposes AAM (\textbf{\underline{A}}ct-\textbf{\underline{A}}daptive \textbf{\underline{M}}argin), which enhances reward modeling by dynamically calibrating preference margins using the model's internal parameter knowledge. We design two versions of AAM that efficiently generate contextually-appropriate preference gaps without additional human annotation. This approach fundamentally improves how reward models handle subjective rewards by better integrating generative understanding with preference scoring. To validate AAM's effectiveness in subjective reward modeling, we conduct evaluations on RewardBench, JudgeBench, and challenging role-playing tasks. Results show that AAM significantly improves subjective reward modeling performance, enhancing Bradley-Terry reward models by 2.95\% in general tasks and 4.85\% in subjective role-playing tasks. Furthermore, reward models trained with AAM can help downstream alignment tasks achieve better results. Our test results show that applying rewards generated by AAM-Augmented RM to preference learning techniques (e.g., GRPO) achieves state-of-the-art results on CharacterEval and Charm. Code and dataset are available at https://github.com/calubkk/AAM.
翻译:当前,大多数强化学习任务集中于数学和编程等领域,这些领域的验证相对直接。然而,在角色扮演等主观性任务中,对齐技术难以取得进展,这主要是因为使用Bradley-Terry模型进行主观奖励建模在处理模糊偏好时面临重大挑战。为改进主观任务中的奖励建模,本文提出AAM(动作自适应边界),它利用模型内部参数知识动态校准偏好边界,从而增强奖励建模。我们设计了两个版本的AAM,能够高效地生成上下文适宜的偏好差距,而无需额外的人工标注。该方法通过更好地整合生成式理解与偏好评分,从根本上改进了奖励模型处理主观奖励的方式。为验证AAM在主观奖励建模中的有效性,我们在RewardBench、JudgeBench以及具有挑战性的角色扮演任务上进行了评估。结果表明,AAM显著提升了主观奖励建模的性能,将Bradley-Terry奖励模型在一般任务中的表现提升了2.95%,在主观角色扮演任务中提升了4.85%。此外,使用AAM训练的奖励模型有助于下游对齐任务取得更好的结果。我们的测试结果显示,将AAM增强的奖励模型生成的奖励应用于偏好学习技术(例如GRPO),在CharacterEval和Charm基准上达到了最先进的性能。代码和数据集可在 https://github.com/calubkk/AAM 获取。