Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability - adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a theoretically grounded approach to AI control for code generation tasks, though practical deployment requires addressing the high false positive rates observed in initial evaluations.
翻译:大型语言模型(LLM)越来越多地在极少人工监督下生成代码,这引发了关于后门注入和恶意行为的关键担忧。我们提出交叉轨迹验证协议(CTVP),这是一种通过语义轨道分析验证不可信代码生成模型的新型AI控制框架。CTVP不直接执行潜在恶意代码,而是利用模型自身对语义等价程序变换下执行轨迹的预测。通过分析这些预测轨迹中的一致性模式,我们检测出指示后门的异常行为。本方法引入了对抗鲁棒性商数(ARQ),该指标量化了验证相对于基线生成的计算成本,并证明其随轨道规模呈指数增长。理论分析建立了信息论边界,表明该方法具有不可博弈性——由于根本的空间复杂度限制,攻击者无法通过训练改进。本研究表明,语义轨道分析为代码生成任务的AI控制提供了理论依据,但实际部署仍需解决初步评估中观察到的高误报率问题。