WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding,Xuanlang Dai,Long Xing,Shengyuan Ding,Ziyu Liu,Yang JingYi,Penghui Yang,Zhixiong Zhang,Xilin Wei,Xinyu Fang,Yubo Ma,Haodong Duan,Jing Shao,Jiaqi Wang,Dahua Lin,Kai Chen,Yuhang Zang

from arxiv, Github link: https://github.com/internlm/WildClawBench

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

翻译：大型语言模型与视觉语言模型日益驱动着通过命令行界面工具代表用户行动的智能体。然而，大多数智能体基准测试仍依赖合成沙箱、短周期任务、模拟服务接口及最终答案校验，这使得智能体能否在其实际部署的运行环境中完成现实长周期工作成为悬而未决的问题。本文提出WildClawBench——一个包含60项人工撰写、双语多模态任务的原生运行时基准测试，涵盖六大主题类别。每项任务平均耗时约8分钟壁钟时间，调用20余次工具，并在可复现的Docker容器中运行。该容器托管了实际的CLI智能体工具（OpenClaw、Claude Code、Codex或Hermes Agent），可访问真实工具而非模拟服务。评分采用混合机制，结合基于确定性规则的检查、环境状态副作用审计以及用于语义验证的大语言模型/视觉语言模型裁判。在19个前沿模型中，表现最佳的Claude Opus 4.7在OpenClaw框架下仅达62.2%的总体正确率，其余模型均低于60%；仅切换工具框架即可使同一模型得分波动高达18个百分点。这些结果表明，对于当前前沿模型而言，原生运行时环境下的长周期智能体评估仍是一项远未解决的挑战。我们已开源任务、代码及容器化工具以支持可复现评估。