Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Autonomous agents powered by large language models introduce a class of execution-layer vulnerabilities -- prompt injection, retrieval poisoning, and uncontrolled tool invocation -- that existing guardrails fail to address systematically. In this work, we propose the Layered Governance Architecture (LGA), a four-layer framework comprising execution sandboxing (L1), intent verification (L2), zero-trust inter-agent authorization (L3), and immutable audit logging (L4). To evaluate LGA, we construct a bilingual benchmark (Chinese original, English via machine translation) of 1,081 tool-call samples -- covering prompt injection, RAG poisoning, and malicious skill plugins -- and apply it to OpenClaw, a representative open-source agent framework. Experimental results on Layer 2 intent verification with four local LLM judges (Qwen3.5-4B, Llama-3.1-8B, Qwen3.5-9B, Qwen2.5-14B) and one cloud judge (GPT-4o-mini) show that all five LLM judges intercept 93.0-98.5% of TC1/TC2 malicious tool calls, while lightweight NLI baselines remain below 10%. TC3 (malicious skill plugins) proves harder at 75-94% IR among judges with meaningful precision-recall balance, motivating complementary enforcement at Layers 1 and 3. Qwen2.5-14B achieves the best local balance (98% IR, approximately 10-20% FPR); a two-stage cascade (Qwen3.5-9B->GPT-4o-mini) achieves 91.9-92.6% IR with 1.9-6.7% FPR; a fully local cascade (Qwen3.5-9B->Qwen2.5-14B) achieves 94.7-95.6% IR with 6.0-9.7% FPR for data-sovereign deployments. An end-to-end pipeline evaluation (n=100) demonstrates that all four layers operate in concert with 96% IR and a total P50 latency of approximately 980 ms, of which the non-judge layers contribute only approximately 18 ms. Generalization to the external InjecAgent benchmark yields 99-100% interception, confirming robustness beyond our synthetic data.

翻译：基于大语言模型的自主智能体引入了一类执行层漏洞——提示注入、检索毒化与不可控工具调用——现有防护机制未能系统性地应对。本研究提出分层治理架构，该四层框架包含执行沙箱化、意图验证、零信任智能体间授权与不可变审计日志。为评估LGA，我们构建了一个包含1,081个工具调用样本的双语基准数据集，涵盖提示注入、RAG毒化与恶意技能插件三类威胁，并将其应用于代表性开源智能体框架OpenClaw。在第二层意图验证的实验中，采用四个本地LLM评判器与一个云端评判器进行评估，结果显示五类LLM评判器对TC1/TC2类恶意工具调用的拦截率达到93.0-98.5%，而轻量级NLI基线的拦截率均低于10%。TC3类威胁的拦截难度显著提升，各评判器在保持合理查准率-查全率平衡的前提下达到75-94%的拦截率，这凸显了在第一层与第三层实施协同防护的必要性。Qwen2.5-14B实现了最佳的本地平衡；双层级联方案在保持1.9-6.7%误报率的同时实现91.9-92.6%的拦截率；完全本地化级联方案在数据主权部署场景下达到94.7-95.6%的拦截率与6.0-9.7%的误报率。端到端管道评估表明，四层防护机制协同运作可实现96%的拦截率，其中非评判器层仅贡献约18毫秒的延迟。在外部InjecAgent基准测试中，本架构实现了99-100%的拦截率，证实了其超越合成数据的泛化鲁棒性。