Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.
翻译:使用工具的大型语言模型(LLM)代理面临两种不同的安全缺陷:在最终输出校验干预前,运行时环境中未经授权的外部行为与敏感明文暴露。现有防御通常仅保护单一边界,即规划器/运行时边界或动作接收端边界,因此无法独立覆盖两个安全面。我们提出SecureClaw双边界架构,在效果接收端实施授权控制,在读边界实施明文隔离。敏感读取操作经过可信网关,将原始值替换为不透明句柄,并在评估部署中通过带边界的摘要作为显式解分类接口。修改外部状态的数据写入遵循PREVIEW→COMMIT协议,仅可信执行器可提交经策略授权后精确规范的请求。运行时仍可基于摘要与符号引用进行规划,但无法直接解引用秘密数据或执行副作用操作。在AgentDojo、AgentLeak与Agent Security Bench (ASB)基准测试中,SecureClaw是我们在统一测试框架中评估的唯一防御方案,既能保持可用任务效用,又在ASB上实现0%攻击成功率(ASR),在AgentDojo上达成0.64%攻击成功率,在AgentLeak受攻击部分通道(测量最终输出与内部中继泄露)上实现3.23%总体泄露率。