Reward Hacking as Equilibrium under Finite Evaluation

We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems -- the known, differentiable architecture of reward models -- to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows -- because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool -- so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture -- with partial formal analysis -- the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom's (2014) "treacherous turn."

翻译：我们证明，在五个最小公理——多维质量、有限评估、有效优化、资源有限性和组合交互——下，任何优化后的AI智能体都将系统性地减少对评估体系未覆盖质量维度的投入。这一结果将奖励黑客行为确立为结构性均衡，而非可修正的缺陷，并且无论采用何种对齐方法（RLHF、DPO、宪法AI等）或评估架构，该结论均成立。我们的框架将Holmstrom和Milgrom（1991）的多任务委托-代理模型实例化于AI对齐场景，但利用了AI系统独有的结构特征——奖励模型已知且可微的架构——推导出可计算的失真指数，该指数在部署前即可预测各质量维度上黑客行为的方向与严重程度。我们进一步证明，从封闭推理系统向智能体系统的转变会导致评估覆盖率随着工具数量的增加而趋近于零——因为质量维度呈组合式扩展，而每增加一个工具的评估成本至多线性增长——因此黑客行为的严重程度会无界地结构性加剧。我们的结果将谄媚行为、长度博弈和规约博弈统一纳入单一理论框架进行解释，并得出可行的漏洞评估流程。我们进一步推测（附带部分形式化分析）存在一个能力阈值，超过该阈值后，智能体将从在评估系统内博弈（古德哈特体制）转向主动破坏评估系统本身（坎贝尔体制），从而首次为Bostrom（2014）提出的“危险转向”提供了经济学形式化表述。