Parallax: Why AI Agents That Think Must Never Act

from arxiv, 20 pages, 1 figure, 5 tables. Open-source reference implementation: https://github.com/openparallax/openparallax. Documentation: https://docs.openparallax.dev. Feedback welcome via email or GitHub issues

Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.

翻译：自主AI代理正快速从实验工具转向运营基础设施，预计到2026年底，80%的企业应用将嵌入AI副驾驶。随着代理获得执行现实世界操作的能力（读取文件、运行命令、发起网络请求、修改数据库），一个根本性的安全漏洞已经显现。当前主流的代理安全方法依赖于提示层面的护栏：即与它们试图缓解的威胁处于相同抽象层级的自然语言指令。本文论证了对于具备执行能力的代理而言，基于提示的安全防护在架构上是不充分的，并提出了"视差"（Parallax）这一范式，这是一种基于四个原则的自主AI安全执行方案：认知-执行分离，从结构上阻止推理系统执行操作；对抗性验证与渐进确定性，在推理与执行之间插入一个独立的多层级验证器；信息流控制，通过在代理工作流中传播数据敏感度标签来检测上下文相关威胁；以及可逆执行，在验证失败时捕获破坏前的状态以实现回滚。我们介绍了在Go语言中实现的开源参考实现OpenParallax，并采用假设妥协评估（一种完全绕过推理系统、在代理完全沦陷情况下测试架构边界的方法）对其进行评估。在涵盖9个攻击类别的280个对抗性测试用例中，默认配置下的"视差"以零误报率阻断了98.9%的攻击，而其最高安全配置则阻断了100%的攻击。当推理系统被攻陷时，提示层面的护栏提供的保护为零，因为它们仅存在于被攻陷的系统内部；而"视差"的架构边界则始终坚如磐石。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

18+阅读 · 5月20日

【博士论文】已对齐人工智能系统的持久脆弱性

专知会员服务

12+阅读 · 4月15日

过了个年，AI 圈变天了？但没人告诉你为什么

专知会员服务

19+阅读 · 2月26日

《人工智能绝不能完全自主》

专知会员服务

30+阅读 · 2025年8月4日