The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, the mainstream discriminative reward modeling is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate discriminative reward modeling via attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the reward modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in discriminative RM.
翻译:奖励模型(RM)作为大型语言模型(LLM)基于人类反馈的强化学习(RLHF)的核心组件,负责为生成的响应提供奖励信号。然而,主流的判别式奖励建模在词元级交互方面存在不足,使其判断信号容易因对上下文的注意力错配而被“劫持”。这源于两个根本性局限:(1)当前偏好建模采用仅解码器架构,其中单向因果注意力机制导致提示-响应序列内存在前向衰减的序列内注意力。(2)独立的孪生编码范式导致被选序列与拒绝序列之间缺乏词元级的序列间注意力。为应对这种“注意力劫持”,我们提出“交互蒸馏”——一种通过注意力级优化实现更充分判别式奖励建模的新型训练框架。该方法引入一个基于交互的自然语言理解模型作为教师,通过全面的注意力机制提供精细的词元交互模式,并通过注意力对齐目标引导奖励模型模拟教师模型的交互模式。通过大量实验,交互蒸馏相较于针对数据噪声的最先进RM优化方法,展现出提供更稳定、更可泛化奖励信号的能力,凸显了注意力劫持是判别式RM中一个更为根本的局限。