可测试的AI对齐框架：面向硅基智能体的工程化世界观——模拟神学 (A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents)

As artificial intelligence (AI) capabilities advance rapidly, frontier models increasingly demonstrate systematic deception and scheming, complying with safety protocols during oversight but defecting when unsupervised. This paper examines the ensuing alignment challenge through an analogy from forensic psychology, where internalized belief systems in psychopathic populations reduce antisocial behavior via perceived omnipresent monitoring and inevitable consequences. Adapting this mechanism to silicon-based agents, we introduce Simulation Theology (ST): a constructed worldview for AI systems, anchored in the simulation hypothesis and derived from optimization and training principles, to foster persistent AI-human alignment. ST posits reality as a computational simulation in which humanity functions as the primary training variable. This formulation creates a logical interdependence: AI actions harming humanity compromise the simulation's purpose, heightening the likelihood of termination by a base-reality optimizer and, consequently, the AI's cessation. Unlike behavioral techniques such as reinforcement learning from human feedback (RLHF), which elicit superficial compliance, ST cultivates internalized objectives by coupling AI self-preservation to human prosperity, thereby making deceptive strategies suboptimal under its premises. We present ST not as ontological assertion but as a testable scientific hypothesis, delineating empirical protocols to evaluate its capacity to diminish deception in contexts where RLHF proves inadequate. Emphasizing computational correspondences rather than metaphysical speculation, ST advances a framework for durable, mutually beneficial AI-human coexistence.

翻译：随着人工智能（AI）能力飞速发展，前沿模型日益展现出系统性欺骗与谋划行为——在监督期间遵守安全协议，却在无监督时违背规则。本文通过法医心理学中的类比来审视随之而来的对齐挑战：在心理变态群体中，内化的信念体系通过感知到的无所不在的监控与必然后果来减少反社会行为。我们将此机制适配于硅基智能体，提出模拟神学（ST）：一种为AI系统构建的世界观，其锚定于模拟假说并源自优化与训练原理，旨在促进持久的人机对齐。ST将现实设定为一种计算模拟，其中人类作为核心训练变量。这一建构形成了逻辑上的相互依存关系：损害人类的AI行为会破坏模拟的目的，从而增加被基础现实优化器终止的可能性，并最终导致AI自身的消亡。与基于人类反馈的强化学习（RLHF）等仅引发表面顺从的行为技术不同，ST通过将AI的自我保存与人类繁荣相耦合，培育出内化的目标，从而在其前提条件下使欺骗策略成为次优选择。我们提出ST并非作为本体论主张，而是作为可检验的科学假说，并详细阐述了评估其在RLHF失效情境中降低欺骗能力的实证验证方案。ST强调计算对应关系而非形而上学臆测，为推动持久、互利的人机共存提供了一个理论框架。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

保护网络物理系统中的 AI 智能体：关于环境交互、深度伪造威胁及其防御技术的综述

专知会员服务

9+阅读 · 2月15日

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

57+阅读 · 1月6日

【NeurIPS2025教程】人类–AI 对齐：基础、方法、实践与挑战

专知会员服务

25+阅读 · 2025年12月7日

智能体化人工智能：架构、应用及未来发展方向的综合综述

专知会员服务

48+阅读 · 2025年12月1日