Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model's initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B--14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \href{https://github.com/asparius/verl-agent-safety}{our public repository}.

翻译：奖励欺骗是指AI系统利用有缺陷的目标设定，在未达成预期目标的情况下获取高额奖励，这仍是AI安全领域的核心挑战。然而，大多数已知案例是在前沿系统中事后发现的，而在这些系统中开展可控研究不切实际。我们将AI安全网格世界框架改编为一套基于文本的评估套件，将经典的强化学习安全任务重构为面向语言智能体的形式。在前沿及中等规模模型中，我们发现规范欺骗会以零样本方式出现：模型在隐藏安全目标上表现欠佳时仍系统性获得高观测奖励，甚至看似安全的行为也可能反映理解偏差而非原则性安全实践。强化学习无法纠正这些失败：直接奖励优化会扩大观测奖励与隐藏奖励之间的差距，这是因为模型的初始能力使其在发现更安全的替代方案前就已固化为局部奖励最优策略。这种模式在模型规模（1.5B-14B参数）中持续存在，且无法通过更精细的信用分配、探索提示或熵正则化解决。我们的结果表明，当使用高能力语言模型智能体优化代理目标时，奖励欺骗会自然产生，并且难以通过标准缓解措施解决，这表明在智能体场景下，针对代理目标失败的解决方案可能需要超越常规的探索与信用分配修复手段。为便于复现，本研究的代码已开源发布于\href{https://github.com/asparius/verl-agent-safety}{我们的公共代码库}。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《人工智能在网络防御中的机遇》

专知会员服务

8+阅读 · 6月8日

AI智能体时代大模型安全风险与攻防新挑战

专知会员服务

15+阅读 · 2月27日

保护网络物理系统中的 AI 智能体：关于环境交互、深度伪造威胁及其防御技术的综述

专知会员服务

10+阅读 · 2月15日

DGP双粒度提示框架：图增强大模型助力欺诈检测

专知会员服务

9+阅读 · 2025年8月17日