Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

from arxiv, 16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: https://github.com/Zetetic-Dhruv/audit-compression-progress

Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

翻译：压缩进展是内在动机的一个长期提案：当智能体的世界模型在预测或压缩经验方面变得更好时给予奖励。坊间主张这种奖励是“可信的”，因为它仅在学习时支付。我们精确化了这一点并加以证明。如果内在奖励是固定密封审计损失的符号减少，即 r_t = E(θ_{t-1}) - E(θ_t)，那么累积奖励直接等于终点审计改进，因此任何策略都无法在真实审计表现停滞或退化的情况下无限推高奖励。对于有限审计面板，同样的结果成立，且具有尖锐的假阳性预算：累积经验奖励最多等于真实审计改进加上 2 Δ_n(F, δ)，即模型类的均匀审计偏差。这是无时间视界的：一旦密封面板均匀控制了该类，随时间变化的适应性不产生额外成本。该定理还识别了失败模式：如果进展被截断、根据智能体自身流评分、暴露于可重用面板上的高容量模型，或应用于使 Δ_n 无意义的神经类，则保证消失。我们提供了结构核心的 Lean 4 形式化证明（望远镜求和、有限审计界、有限吉布斯和熵下限），以及在 ARC-TGI 网格变换生成器上结合自适应保留攻击的实验套件。实验证实了理论：有限审计偏差按 n^{-0.527} 缩放；符号进展抵抗了截断耕作、流泄漏和噪声电视好奇心；朴素的可重用审计可被黑盒标量反馈利用，而标准发布防御将攻击保持在 2 Δ_n 阈值以下。密封审计上的符号压缩进展是真正改进的会计信号。