As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how reward models generalize, we craft 69 distribution shifts spanning 8 categories. We find that reward models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting reward models' internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENeralization analogIES (GENIES) benchmark, which we hope will enable progress toward controlling reward model generalization.
翻译:随着AI系统日趋智能且其行为评估难度不断增大,它们可能学会利用人类反馈的缺陷而非真正遵循指令。通过控制大语言模型如何将人类反馈泛化至不可靠情境,可缓解这一风险。为深入理解奖励模型的泛化机制,我们构建了涵盖8个类别的69种分布偏移场景。研究发现:奖励模型默认并未学会评估"指令遵循"行为,反而更偏好类似互联网文本的人物特征。相较于标准微调,解释奖励模型内部表征的技术能实现更优的泛化效果,但仍常无法区分指令遵循与混杂行为。我们将最具挑战性的15种分布偏移整合为通用化类比基准(GENIES),期望该基准能推动奖励模型泛化控制研究的进展。