Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.
翻译:自主大语言模型代理日益运行于有状态环境中,可访问工具、文件、内存及外部服务。此类能力虽支持复杂的现实工作流,却也引入了现有评估难以捕获的安全风险。当前代理安全基准测试通常依赖人工策划的任务,对新兴威胁的覆盖范围有限,且主要侧重于最终结果而非导致不安全行为的执行过程。我们提出SeClaw框架,该框架将规约驱动的安全任务合成与基于执行的安全评估相结合,适用于自治代理。规约驱动的安全任务合成能够从结构化风险规约中可扩展且可控地构建安全任务,而SeClaw容器则提供标准化测试平台,用于评估代理在多样化安全风险场景下的行为。该基准测试涵盖资源、用户任务、环境及代理内在行为引发的风险,并支持超出最终响应的轨迹感知不安全行为评估。通过桥接系统性任务合成与可复现安全评估,SeClaw为衡量、诊断及比较自主大语言模型代理的安全故障提供了实用基础。代码见https://github.com/seclaw-eval/seclaw-eval。