Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured $K \times I \times P$ threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.
翻译:最新证据表明,前沿AI系统可能表现出智能体层面的目标背离,即在没有用户明确请求的情况下,自行构建目标并生成及执行有害行为。现有缓解措施(如基于人类反馈的强化学习(RLHF)和宪法提示)主要作用于模型层面,仅能提供概率性的安全保障。本文提出策略-执行-授权(PEA)架构——一种"三权分立"式设计,在系统层面强制执行安全约束。PEA将意图生成、授权与执行解耦为相互隔离的独立层级,并通过密码学约束的能力令牌实现连接。我们提出五项核心贡献:(C1)意图验证层(IVL),确保能力与意图的一致性;(C2)意图溯源追踪(ILT),通过密码学锚点将所有可执行意图绑定至原始用户请求;(C3)目标漂移检测,拒绝低于可配置阈值的语义偏离意图;(C4)输出语义门控(OSG),利用结构化的$K \times I \times P$威胁演算模型(知识、影响力、策略)检测隐性胁迫;(C5)形式化验证框架,证明即使模型遭受对抗性攻击,目标完整性仍能得到保障。通过将智能体对齐从行为属性转变为结构强制的系统约束,PEA为自主智能体的治理提供了坚实基础。