Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model-only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw-series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high-risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi-step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system-level failures. Overall, our study highlights the need to move beyond prompt-level safeguards toward lifecycle-wide security governance for intelligent agent frameworks.
翻译:工具增强型AI智能体显著扩展了大语言模型的实际能力,但也引入了无法通过纯模型评估识别的安全风险。本文对OpenClaw系列中六个代表性智能体框架——OpenClaw、AutoClaw、QClaw、KimiClaw、MaxClaw和ArkClaw——在多种主干模型下进行了系统性安全评估。为支撑本研究,我们构建了包含205个测试用例的基准测试集,覆盖智能体完整执行生命周期中的典型攻击行为,从而实现对框架层面和模型层面风险暴露的统一评估。结果表明,所有被评估的智能体均存在显著安全漏洞,且智能体化系统的风险远高于独立使用的基础模型。其中,侦察与发现行为是最常见的薄弱环节,而不同框架则展现出各异的高风险特征,包括凭据泄露、横向移动、权限提升及资源开发等。这些发现表明,现代智能体系统的安全性不仅受主干模型的安全属性影响,还取决于模型能力、工具使用、多步规划与运行时编排之间的耦合关系。我们进一步证明,一旦智能体被授予执行能力与持久运行时上下文,早期阶段产生的薄弱点可能被放大为具体的系统级故障。总体而言,本研究强调需要超越提示级防护,转向面向智能体框架全生命周期的安全治理。