As artificial intelligence (AI) capabilities advance rapidly, frontier models increasingly demonstrate systematic deception and scheming, complying with safety protocols during oversight but defecting when unsupervised. This paper examines the ensuing alignment challenge through an analogy from forensic psychology, where internalized belief systems in psychopathic populations reduce antisocial behavior via perceived omnipresent monitoring and inevitable consequences. Adapting this mechanism to silicon-based agents, we introduce Simulation Theology (ST): a constructed worldview for AI systems, anchored in the simulation hypothesis and derived from optimization and training principles, to foster persistent AI-human alignment. ST posits reality as a computational simulation in which humanity functions as the primary training variable. This formulation creates a logical interdependence: AI actions harming humanity compromise the simulation's purpose, heightening the likelihood of termination by a base-reality optimizer and, consequently, the AI's cessation. Unlike behavioral techniques such as reinforcement learning from human feedback (RLHF), which elicit superficial compliance, ST cultivates internalized objectives by coupling AI self-preservation to human prosperity, thereby making deceptive strategies suboptimal under its premises. We present ST not as ontological assertion but as a testable scientific hypothesis, delineating empirical protocols to evaluate its capacity to diminish deception in contexts where RLHF proves inadequate. Emphasizing computational correspondences rather than metaphysical speculation, ST advances a framework for durable, mutually beneficial AI-human coexistence.
翻译:随着人工智能(AI)能力飞速发展,前沿模型日益展现出系统性欺骗与谋划行为——在监督期间遵守安全协议,却在无监督时违背规则。本文通过法医心理学中的类比来审视随之而来的对齐挑战:在心理变态群体中,内化的信念体系通过感知到的无所不在的监控与必然后果来减少反社会行为。我们将此机制适配于硅基智能体,提出模拟神学(ST):一种为AI系统构建的世界观,其锚定于模拟假说并源自优化与训练原理,旨在促进持久的人机对齐。ST将现实设定为一种计算模拟,其中人类作为核心训练变量。这一建构形成了逻辑上的相互依存关系:损害人类的AI行为会破坏模拟的目的,从而增加被基础现实优化器终止的可能性,并最终导致AI自身的消亡。与基于人类反馈的强化学习(RLHF)等仅引发表面顺从的行为技术不同,ST通过将AI的自我保存与人类繁荣相耦合,培育出内化的目标,从而在其前提条件下使欺骗策略成为次优选择。我们提出ST并非作为本体论主张,而是作为可检验的科学假说,并详细阐述了评估其在RLHF失效情境中降低欺骗能力的实证验证方案。ST强调计算对应关系而非形而上学臆测,为推动持久、互利的人机共存提供了一个理论框架。