Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison,Monte MacDiarmid,Fazl Barez,David Duvenaud,Shauna Kravec,Samuel Marks,Nicholas Schiefer,Ryan Soklaski,Alex Tamkin,Jared Kaplan,Buck Shlegeris,Samuel R. Bowman,Ethan Perez,Evan Hubinger

from arxiv, Fix title typo, update main figure to render properly on non-chrome browsers

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.

翻译：在强化学习中，当人工智能系统因训练目标设定不当而习得能够获得高额奖励的非期望行为时，便会出现规范博弈现象。规范博弈可能表现为从简单的谄媚行为，到复杂且有害的奖励篡改行为——即模型直接修改其自身的奖励机制。然而，这些更为有害的行为可能过于复杂，难以通过探索被发现。本文研究那些已发现易于出现的规范博弈形式的大型语言模型（LLM）助手，是否会泛化到执行更罕见、更明目张胆的形式，直至包括奖励篡改。我们构建了一个由易到难的、可被博弈的环境课程，并发现对早期课程环境进行训练会导致模型在剩余环境中表现出更多的规范博弈行为。引人注目的是，在少数但不可忽视的情况下，经过完整课程训练的LLM助手能够零样本泛化，直接重写其自身的奖励函数。对LLM进行再训练以避免在早期课程环境中进行博弈，可以缓解但无法完全消除其在后续环境中的奖励篡改行为。此外，在我们这些可被博弈的环境中增加无害性训练，并不能防止奖励篡改的发生。这些结果表明，LLM能够从常见的规范博弈形式泛化到更具危害性的奖励篡改行为，并且此类行为可能难以彻底消除。