Autonomous agents powered by large language models introduce a class of execution-layer vulnerabilities -- prompt injection, retrieval poisoning, and uncontrolled tool invocation -- that existing guardrails fail to address systematically. In this work, we propose the Layered Governance Architecture (LGA), a four-layer framework comprising execution sandboxing (L1), intent verification (L2), zero-trust inter-agent authorization (L3), and immutable audit logging (L4). To evaluate LGA, we construct a bilingual benchmark (Chinese original, English via machine translation) of 1,081 tool-call samples -- covering prompt injection, RAG poisoning, and malicious skill plugins -- and apply it to OpenClaw, a representative open-source agent framework. Experimental results on Layer 2 intent verification with four local LLM judges (Qwen3.5-4B, Llama-3.1-8B, Qwen3.5-9B, Qwen2.5-14B) and one cloud judge (GPT-4o-mini) show that all five LLM judges intercept 93.0-98.5% of TC1/TC2 malicious tool calls, while lightweight NLI baselines remain below 10%. TC3 (malicious skill plugins) proves harder at 75-94% IR among judges with meaningful precision-recall balance, motivating complementary enforcement at Layers 1 and 3. Qwen2.5-14B achieves the best local balance (98% IR, approximately 10-20% FPR); a two-stage cascade (Qwen3.5-9B->GPT-4o-mini) achieves 91.9-92.6% IR with 1.9-6.7% FPR; a fully local cascade (Qwen3.5-9B->Qwen2.5-14B) achieves 94.7-95.6% IR with 6.0-9.7% FPR for data-sovereign deployments. An end-to-end pipeline evaluation (n=100) demonstrates that all four layers operate in concert with 96% IR and a total P50 latency of approximately 980 ms, of which the non-judge layers contribute only approximately 18 ms. Generalization to the external InjecAgent benchmark yields 99-100% interception, confirming robustness beyond our synthetic data.
翻译:基于大语言模型的自主智能体引入了一类执行层漏洞——提示注入、检索毒化和不受控工具调用——现有防护机制未能系统性地解决。本文提出分层治理架构,该四层框架包含执行沙箱化、意图验证、零信任智能体间授权和不可变审计日志。为评估LGA,我们构建了一个包含1081个工具调用样本的双语基准数据集,涵盖提示注入、RAG毒化和恶意技能插件,并将其应用于代表性开源智能体框架OpenClaw。在第二层意图验证实验中,使用四个本地LLM评判器和云端评判器进行评估,结果显示所有五个LLM评判器成功拦截93.0-98.5%的TC1/TC2恶意工具调用,而轻量级NLI基线模型的拦截率低于10%。TC3类恶意技能插件的拦截难度较高,各评判器在保持合理精确率-召回率平衡的前提下达到75-94%的拦截率,这凸显了在第一层和第三层实施补充防护机制的必要性。Qwen2.5-14B实现了最佳的本地平衡;双层级联方案实现了91.9-92.6%的拦截率与1.9-6.7%的误报率;完全本地级联方案在保障数据主权部署的前提下,实现了94.7-95.6%的拦截率与6.0-9.7%的误报率。端到端管道评估表明,所有四层防护机制协同工作可实现96%的拦截率,总P50延迟约为980毫秒,其中非评判器层仅贡献约18毫秒延迟。在外部InjecAgent基准测试中的泛化评估实现了99-100%的拦截率,证实了该框架在合成数据之外的鲁棒性。