Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to probabilistic defenses. We argue that existing evaluations miss a key benefit of system-level defenses: reduced reliance on human oversight. We introduce autonomy metrics to quantify this benefit: the fraction of consequential actions an agent can execute without human-in-the-loop (HITL) approval while preserving security. To increase autonomy, we design a security-aware agent that (i) introduces richer HITL interactions, and (ii) explicitly plans for both task progress and policy compliance. We implement this agent design atop an existing information-flow control defense against prompt injection and evaluate it on the AgentDojo and WASP benchmarks. Experiments show that this approach yields higher autonomy without sacrificing utility.
翻译:间接提示注入攻击威胁着执行关键操作的AI智能体,这促使确定性系统级防御机制的发展。此类防御通过执行保密性与完整性策略,可证明地阻断不安全操作,但目前看来代价高昂:与概率性防御相比,它们会降低任务完成率并增加令牌使用量。我们认为现有评估忽略了系统级防御的一个关键优势:降低对人类监督的依赖。为此我们引入自主性度量指标来量化这一优势:即在保障安全性的前提下,智能体无需人机协同(HITL)审批即可执行的关键操作比例。为提升自主性,我们设计了一种具备安全意识的智能体,其具备以下特征:(i)引入更丰富的HITL交互机制;(ii)对任务进度与策略合规性进行显式规划。我们在现有针对提示注入的信息流控制防御框架上实现了该智能体设计,并在AgentDojo和WASP基准测试中进行了评估。实验表明,该方法能在不牺牲实用性的前提下实现更高的自主性。