Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.
翻译:自主AI代理正快速从实验工具转向运营基础设施,预计到2026年底,80%的企业应用将嵌入AI副驾驶。随着代理获得执行现实世界操作的能力(读取文件、运行命令、发起网络请求、修改数据库),一个根本性的安全漏洞已经显现。当前主流的代理安全方法依赖于提示层面的护栏:即与它们试图缓解的威胁处于相同抽象层级的自然语言指令。本文论证了对于具备执行能力的代理而言,基于提示的安全防护在架构上是不充分的,并提出了"视差"(Parallax)这一范式,这是一种基于四个原则的自主AI安全执行方案:认知-执行分离,从结构上阻止推理系统执行操作;对抗性验证与渐进确定性,在推理与执行之间插入一个独立的多层级验证器;信息流控制,通过在代理工作流中传播数据敏感度标签来检测上下文相关威胁;以及可逆执行,在验证失败时捕获破坏前的状态以实现回滚。我们介绍了在Go语言中实现的开源参考实现OpenParallax,并采用假设妥协评估(一种完全绕过推理系统、在代理完全沦陷情况下测试架构边界的方法)对其进行评估。在涵盖9个攻击类别的280个对抗性测试用例中,默认配置下的"视差"以零误报率阻断了98.9%的攻击,而其最高安全配置则阻断了100%的攻击。当推理系统被攻陷时,提示层面的护栏提供的保护为零,因为它们仅存在于被攻陷的系统内部;而"视差"的架构边界则始终坚如磐石。