AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.
翻译:通过工具调用与现实世界交互的AI智能体带来了根本性的安全挑战:智能体可能泄露私人信息、引发意外副作用,或通过提示注入被恶意操纵。为应对这些挑战,我们提出将智能体置于基于编程语言的“安全约束装置”中:智能体不再直接调用工具,而是将其意图表达为采用能力安全语言——即支持捕获检查的Scala 3——编写的代码。能力作为程序变量,用于规范对目标效果和资源的访问。Scala的类型系统可静态追踪能力,从而实现对智能体行为的细粒度控制。该系统特别支持局部纯化功能,即能够强制子计算过程无副作用,防止智能体在处理涉密数据时发生信息泄露。我们通过利用具备能力追踪功能的强类型系统,论证了可扩展智能体安全约束装置的构建可行性。实验表明,智能体能够生成符合能力安全规范的代码,且任务性能未出现显著下降,同时类型系统能可靠地阻止信息泄露和恶意副作用等不安全行为。