We propose an information-theoretic formalization of the distinction between two fundamental AI safety failure modes: deceptive alignment and goal drift. While both can lead to systems that appear misaligned, we demonstrate that they represent distinct forms of information divergence occurring at different interfaces in the human-AI system. Deceptive alignment creates entropy between an agent's true goals and its observable behavior, while goal drift, or confusion, creates entropy between the intended human goal and the agent's actual goal. Though often observationally equivalent, these failures necessitate different interventions. We present a formal model and an illustrative thought experiment to clarify this distinction. We offer a formal language for re-examining prominent alignment challenges observed in Large Language Models (LLMs), offering novel perspectives on their underlying causes.
翻译:我们提出了一种信息论形式化框架,用于区分两种基本的人工智能安全失效模式:欺骗性对齐与目标漂移。尽管二者均可能导致系统表现出失准行为,但我们证明它们代表了人机系统不同接口处发生的两种信息发散形式。欺骗性对齐在智能体的真实目标与可观测行为之间产生信息熵,而目标漂移(或称混淆)则在人类预期目标与智能体实际目标之间产生信息熵。虽然这两种失效模式在观测上通常具有等效性,但需要采取不同的干预措施。我们通过构建形式化模型和启发性思想实验来阐明这一区分,并提出一套形式化语言用以重新审视大型语言模型中观察到的显著对齐挑战,从而为其根本成因提供新的理论视角。