Extending the Formalism and Theoretical Foundations of Cryptography to AI

Recent progress in (Large) Language Models (LMs) has enabled the development of autonomous LM-based agents capable of executing complex tasks with minimal supervision. These agents have started to be integrated into systems with significant autonomy and authority. The security community has been studying their security. One emerging direction to mitigate security risks is to constrain agent behaviours via access control and permissioning mechanisms. Existing permissioning proposals, however, remain difficult to compare due to the absence of a shared formal foundation. This work provides such a foundation. We first systematize the landscape by constructing an attack taxonomy tailored to language models, the computational primitives of agentic systems. We then develop a formal treatment of agentic access control by defining an AIOracle algorithmically and introducing a security-game framework that captures completeness (in the absence of an adversary) and adversarial robustness. Our security game unifies confidentiality, integrity, and availability within a single model. Using this framework, we show that existing approaches to confidentiality of training data fundamentally conflict with completeness. Finally, we formalize a modular decomposition of helpfulness and harmlessness objectives and prove its soundness, in order to enable principled reasoning about the security of agentic system designs. Our studies suggests that if we were to design a secure system with measurable security, then we might want to use a modular approach to break the problem into sub-problems and let the composition on different modules complete the design. Our studies show that this natural approach with the relevant formalism is needed to prove security reductions.

翻译：近期，(大)语言模型(LMs)的进展使得能够开发出基于LM的自主智能体，这些智能体能够在最小监督下执行复杂任务。这些智能体已开始被集成到具有高度自主性和权限的系统中。安全社区一直在研究其安全性。缓解安全风险的一个新兴方向是通过访问控制与权限机制来约束智能体行为。然而，由于缺乏统一的形式化基础，现有的权限方案仍难以进行比较。本工作提供了这样的基础。我们首先通过构建针对语言模型(智能体系统的计算基元)的攻击分类体系来系统化梳理该领域。随后，我们通过算法化定义AIOracle并引入一个安全博弈框架，建立了智能体访问控制的形式化处理方法，该框架刻画了无对抗条件下的完备性与对抗鲁棒性。我们的安全博弈将机密性、完整性和可用性统一于单一模型中。利用该框架，我们证明现有的训练数据机密性方法与完备性存在根本性冲突。最后，我们形式化地提出了有益性与无害性目标的模块化分解方案并证明其可靠性，从而为智能体系统设计的安全性提供原则性推理依据。我们的研究表明，若要设计具备可度量安全性的系统，可能需要采用模块化方法将问题分解为子问题，并通过不同模块的组合来完成设计。我们的研究证实，这种结合相关形式化体系的自然方法对于证明安全性归约是必要的。