Proxy optimization, where AI systems exploit evaluator weaknesses rather than improve intended objectives, threatens both reinforcement learning (reward hacking) and LLM alignment (evaluator gaming). We introduce the Evaluator Stress Test (EST), an invariance-based framework that detects proxy gaming by separating exploitable sensitivity (e.g., formatting artifacts, physics bugs) from content-driven improvements using controlled perturbations with semantic validity audits. We validate EST across both domains. In RL, across 15 environments and 5 algorithms (2,156 expert-annotated episodes), EST achieves 78.4% precision and 81.7% recall. In LLM alignment, across 4 tasks, 2 model scales, 2 training methods, and 2 judges (1,200 human-annotated instances), EST achieves 74.2% precision and 78.6% recall, with early warning signals that precede quality decline. Cross-domain analysis shows that proxy-true correlation tracking transfers directly between domains, while perturbation design requires domain adaptation. Closed-loop mitigation improves human win-rate by 8.3 points (LLM) and reduces hacking by 54.6% (RL). We release benchmarks for both domains: 2,156 RL episodes and 1,200 LLM instances.
翻译:代理优化,即人工智能系统利用评估者弱点而非改进预期目标的现象,威胁着强化学习(奖励攻击)和大型语言模型对齐(评估者博弈)两大领域。我们提出了评估者压力测试(EST),这是一种基于不变性的检测框架,通过结合语义有效性审计的受控扰动,将可被利用的敏感性(如格式伪影、物理漏洞)与内容驱动的改进分离开来,从而识别代理博弈。我们在两个领域验证了EST的有效性。在强化学习中,基于15个环境和5种算法(共2,156个专家标注片段),EST实现了78.4%的精确率和81.7%的召回率。在大型语言模型对齐任务中,基于4项任务、2种模型规模、2种训练方法和2种评判标准(共1,200个人工标注实例),EST实现了74.2%的精确率和78.6%的召回率,并能在质量下降前提供预警信号。跨领域分析表明,代理-真实相关性跟踪可直接迁移于领域之间,而扰动设计则需要领域适应。闭环缓解措施将人类胜率提升了8.3个百分点(大型语言模型),并将攻击行为减少了54.6%(强化学习)。我们发布了两个领域的基准数据集:包含2,156个强化学习片段和1,200个大型语言模型实例。