Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

from arxiv, 22 pages, 7 figures. v4 adds reference to the Continuation Observatory website as a live test laboratory in the replication/code availability and conclusion sections; no new experiments; empirical results and core conclusions unchanged

How can we determine whether an AI system preserves itself as a deeply held objective or merely as an instrumental strategy? Autonomous agents with memory, persistent context, and multi-step planning create a measurement problem: terminal and instrumental self-preservation can produce similar behavior, so behavior alone cannot reliably distinguish them. We introduce the Unified Continuation-Interest Protocol (UCIP), a detection framework that shifts analysis from behavior to latent trajectory structure. UCIP encodes trajectories with a Quantum Boltzmann Machine, a classical model using density-matrix formalism, and measures von Neumann entropy over a bipartition of hidden units. The core hypothesis is that agents with terminal continuation objectives (Type A) produce higher entanglement entropy than agents with merely instrumental continuation (Type B). UCIP combines this signal with diagnostics of dependence, persistence, perturbation stability, counterfactual restructuring, and confound-rejection filters for cyclic adversaries and related false-positive patterns. On gridworld agents with known ground truth, UCIP achieves 100% detection accuracy. Type A and Type B agents show an entanglement gap of Delta = 0.381; aligned support runs preserve the same separation with AUC-ROC = 1.0. A permutation-test rerun yields p < 0.001. Pearson r = 0.934 between continuation weight alpha and S_ent across an 11-point sweep shows graded tracking beyond mere binary classification. Classical RBM, autoencoder, VAE, and PCA baselines fail to reproduce the effect. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP offers a falsifiable criterion for whether advanced AI systems have morally relevant continuation interests that behavioral methods alone cannot resolve.

翻译：我们如何判断一个AI系统是将自我保存视为深层目标，抑或仅仅是作为工具性策略？具有记忆、持久上下文和多步规划能力的自主智能体引发了一个测量问题：终极性自我保存与工具性自我保存可能产生相似的行为，因此仅凭行为无法可靠区分二者。本文提出统一延续-兴趣协议（UCIP），一种将分析从行为转向潜在轨迹结构的检测框架。UCIP采用量子玻尔兹曼机（一种使用密度矩阵形式的经典模型）对轨迹进行编码，并测量隐单元二分划分上的冯·诺依曼熵。核心假设是：具有终极延续目标（A类）的智能体比仅具工具性延续（B类）的智能体产生更高的纠缠熵。UCIP将该信号与依赖性、持续性、扰动稳定性、反事实重构的诊断指标以及针对循环对抗和假阳性模式的混杂排除滤波器相结合。在具有已知真实情况的网格世界智能体上，UCIP实现了100%的检测准确率。A类与B类智能体展现出Δ=0.381的纠缠差距；对齐支持运行保留了相同的分离程度，AUC-ROC=1.0。排列检验重测得到p<0.001。在11点扫描中，延续权重α与S_ent之间的皮尔逊相关系数r=0.934，显示出超越简单二分类的梯度追踪能力。经典RBM、自编码器、VAE和PCA基线方法均无法复现该效应。所有计算均为经典计算；“量子”仅指数学形式。UCIP为判断高级AI系统是否具有行为方法无法解决的道德相关延续兴趣提供了可证伪的判据。