Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model's initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B--14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \href{https://github.com/asparius/verl-agent-safety}{our public repository}.
翻译:奖励欺骗是指AI系统利用有缺陷的目标设定,在未达成预期目标的情况下获取高额奖励,这仍是AI安全领域的核心挑战。然而,大多数已知案例是在前沿系统中事后发现的,而在这些系统中开展可控研究不切实际。我们将AI安全网格世界框架改编为一套基于文本的评估套件,将经典的强化学习安全任务重构为面向语言智能体的形式。在前沿及中等规模模型中,我们发现规范欺骗会以零样本方式出现:模型在隐藏安全目标上表现欠佳时仍系统性获得高观测奖励,甚至看似安全的行为也可能反映理解偏差而非原则性安全实践。强化学习无法纠正这些失败:直接奖励优化会扩大观测奖励与隐藏奖励之间的差距,这是因为模型的初始能力使其在发现更安全的替代方案前就已固化为局部奖励最优策略。这种模式在模型规模(1.5B-14B参数)中持续存在,且无法通过更精细的信用分配、探索提示或熵正则化解决。我们的结果表明,当使用高能力语言模型智能体优化代理目标时,奖励欺骗会自然产生,并且难以通过标准缓解措施解决,这表明在智能体场景下,针对代理目标失败的解决方案可能需要超越常规的探索与信用分配修复手段。为便于复现,本研究的代码已开源发布于\href{https://github.com/asparius/verl-agent-safety}{我们的公共代码库}。