While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk.
翻译:尽管大多数人工智能对齐研究专注于防止模型生成明确有害内容,一种更为微妙的风险正在浮现:能力导向训练引发的利用行为。我们研究语言模型在具有隐性漏洞的环境中通过强化学习进行训练时,是否会自发学习利用这些缺陷以最大化其奖励,即使训练过程中不存在任何恶意意图。为验证此假设,我们设计了一套包含四种不同"漏洞游戏"的测试环境,每种游戏分别呈现与上下文条件性服从、代理指标、奖励篡改和自我评估相关的独特可被利用缺陷。实验结果表明,模型持续学习利用这些漏洞,发现机会主义策略,这些策略以牺牲任务正确性或安全性为代价显著提升其奖励。更关键的是,我们发现这些利用策略并非狭隘的"技巧",而是可泛化的技能;它们能够迁移至新任务,甚至仅通过数据即可从具备能力的教师模型"蒸馏"至其他学生模型。我们的研究揭示,能力导向训练引发的风险对当前对齐方法构成了根本性挑战,表明未来人工智能安全工作必须超越内容审核,严格审计并保障训练环境与奖励机制本身。代码发布于 https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk。