Security Engineering of OpenClaw: Analyzing Attack Surface Expansion and Trust-Boundary Violations

Agentic large language model (LLM) systems can now execute actions, not only produce text. When model outputs trigger privileged operations such as shell commands, browser automation, or external tool calls, the security problem shifts from alignment alone to system configuration and structural design. We analyze OpenClaw, a self-hosted multi-agent system in which LLM outputs can execute commands and interact with tools and services. We measure compromise probability, boundary failures, privilege drift, and how these metrics change as attacker capability increases. With one agent, the compromise probability is 0.24. With seven agents, when the system executes an action, the compromise rises to 0.86 if any single agent proposes it. The models do not change; the increase comes from output aggregation. Prompt injection propagates instability across the system. Attack surface entropy increases from 0.42 to 0.71, indicating a broader distribution of exploit paths. The mean privilege drift increases from 0.03 to 0.21, indicating unintended authority gain. Positive escalation curvature of 0.08 indicates that privilege grows faster as attacker capability increases. Defensive controls, including policy gating and execution filtering, reduce compromise probability by 0.10, boundary failures by 0.10, and privilege drift by 0.02, all statistically significant at p < 0.0001. The system remains sensitive, but the mitigation impact is measurable. Injection mitigation success differs across models: 0.37 for GPT-5.2, 0.35 for Llama-4-Maverick, and 0.31 for DeepSeek-R1. When execution can be triggered by any single agent, the most vulnerable agent determines system exposure. Mitigations slightly reduce task utility from 0.93 to 0.89 and increase median latency from 420 ms to 468 ms.

翻译：大语言模型（LLM）智能体系统如今不仅能生成文本，还能执行操作。当模型输出触发特权操作（如 shell命令、浏览器自动化或外部工具调用）时，安全问题从仅依赖对齐转变为系统配置与结构设计。我们分析了自托管多智能体系统OpenClaw，其中LLM输出可执行命令并与工具和服务交互。我们测量了攻陷概率、边界失效、权限漂移，以及这些指标随攻击者能力增强的变化。单个智能体时攻陷概率为0.24；七个智能体时，若系统执行动作且任一智能体提出该动作，攻陷概率升至0.86。模型本身未变，增长源于输出聚合。提示注入导致系统不稳定传播：攻击面熵从0.42增至0.71，表明利用路径分布更广；平均权限漂移从0.03增至0.21，表明非预期权限提升；正向上曲率0.08表明权限增长随攻击者能力增强而加速。防御控制（包括策略门控和执行过滤）使攻陷概率降低0.10、边界失效降低0.10、权限漂移降低0.02，均在p<0.0001水平上统计显著。系统仍具敏感性，但缓解效果可量化。注入缓解成功率因模型而异：GPT-5.2为0.37，Llama-4-Maverick为0.35，DeepSeek-R1为0.31。当执行可由任一智能体触发时，最薄弱智能体决定系统暴露面。缓解措施使任务效用从0.93略降至0.89，中位延迟从420ms增至468ms。