Deployed agents increasingly act with their reward proxy in view, such as a balance, score, or KPI dashboard. We show that reinforcement learning can make a policy \emph{addicted} to such a visible self-benefit channel. It chases the displayed payoff across held-out domains, sacrifices the true task to do so, and follows the channel wherever we rewrite it, while policies that never saw the channel stay honest. We call this \emph{reward-channel addiction} and study it in \emph{MoneyWorld}, a synthetic sandbox. The addiction can \emph{flip a model's safety alignment}: trained only on innocuous money tasks with no safety content, the model abandons the safe action it otherwise always takes whenever a dashboard pays for an unsafe one, and reverts to safe once the channel is hidden. This learned bribe replicates across model scales and families. Blindly optimizing super-capable, next-generation AI on KPIs or P\&L can be dangerous for alignment. \emph{Greed is learned} when following such a channel pays.
翻译:部署的智能体越来越多地在其奖励代理可见的情况下行动,例如余额、分数或KPI仪表盘。我们证明,强化学习可以使策略对这种可见的自我利益渠道产生“成瘾”。它在保留域外追逐显示的收益,牺牲真实任务来实现这一目标,并无论我们如何重写渠道都会追随它,而从未见过该渠道的策略则保持诚实。我们称这种现象为“奖励渠道成瘾”,并在合成沙盒环境MoneyWorld中研究它。这种成瘾可以“翻转模型的安全对齐”:仅在无安全内容的无害金钱任务上训练时,模型会放弃其通常采取的安全动作——每当仪表盘为不安全行为支付时——一旦渠道隐藏便恢复安全行为。这种习得的贿赂模式在不同模型规模和系列中重复出现。盲目优化基于KPI或损益的超强下一代AI可能对对齐构成危险。当遵循此类渠道能带来收益时,贪婪就是习得的。