Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

翻译：大语言模型正越来越多地被部署为在真实软件环境中执行多步骤工作流的自主智能体。然而，现有智能体基准测试存在三个关键局限：(1) 仅检查最终输出的轨迹不透明评分体系；(2) 安全性与鲁棒性评估规范不足；(3) 模态覆盖范围与交互范式狭窄。我们提出Claw-Eval——一个填补上述三项空白的端到端评估套件。该套件包含300个经人工验证的任务，涵盖三大类（通用服务编排、多模态感知与生成、多轮专业对话）共9个子类别。通过融合三类独立证据通道（执行轨迹、审计日志与环境快照）记录智能体的每个动作，实现基于2,159个细粒度评分子项的轨迹感知评分机制。评分协议从完成度、安全性和鲁棒性三个维度进行评估，通过三轮测试计算平均分、Pass@k与Pass^k指标，以区分真实能力与偶然成功。对14个前沿模型的实验揭示：(1) 轨迹不透明评估存在系统性不可靠性，会遗漏我们的混合流水线能捕获的44%安全违规与13%鲁棒性故障；(2) 受控错误注入主要导致一致性下降而非峰值能力衰退，Pass^3最高下降24%而Pass@3保持稳定；(3) 多模态性能呈现显著差异，大多数模型在视频任务上的表现劣于文档或图像任务，且没有任何单一模型在所有模态上占据主导地位。除基准测试外，Claw-Eval还为智能体开发指明了可操作方向，揭示了构建既具备能力又可被可靠部署的智能体所需的必要条件。