Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) -- and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder's inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) -- subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier's ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].
翻译:安全门能否在维持有界累积风险的同时允许无界有益自我修改?我们通过双重条件——要求∑δ_n < ∞(有界风险)且∑TPR_n = ∞(无界效用)——形式化该问题,并建立其(不)兼容性理论。分类不可行性定理(定理1):对于幂律风险时间表δ_n = O(n^{-p})且p > 1,在安全/非安全分布重叠条件下,任何基于分类器的安全门通过Hölder不等式满足TPR_n ≤ C_α * δ_n^β,迫使∑TPR_n < ∞。该不可行性具有指数最优性(定理3)。第二类独立证明采用NP计数法(定理4)在无需Hölder不等式条件下获得比原界严格13%的更紧界限。通用有限时域上限定理(定理5):对于任意可求和风险时间表,分类器可实现的最大精确效用为U*(N, B) = N * TPR_NP(B/N),其增长率为exp(O(√log N))——亚多项式级。当N=10^6且预算B=1.0时,分类器最多提取U*≈87,而验证器可达约500,000。验证逃逸定理(定理2):Lipschitz球验证器在TPR>0条件下实现δ=0,突破不可行性边界。基于LoRA的预层归一化Transformer的正式Lipschitz界可实现大规模语言模型级验证,这种分离具有严格性。我们在GPT-2(d_LoRA=147,456)上验证:条件δ=0时TPR=0.352。完整实证验证参见姊妹论文[D2]。