ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang,Yubo Wang,Yipeng Zhu,Penghui Du,Junwen Miao,Xuan Lu,Wendong Xu,Yunzhuo Hao,Songcheng Cai,Xiaochen Wang,Huaisong Zhang,Xian Wu,Yi Lu,Minyi Lei,Kai Zou,Huifeng Yin,Ping Nie,Liang Chen,Dongfu Jiang,Wenhu Chen,Kelsey R. Allen

from arxiv, Project page: https://claw-bench.com

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

翻译：AI代理或许能自动化您的邮箱，但它们能否自动化生活中其他常规事务？日常在线任务为评估下一代AI代理提供了真实且尚未解决的试验场。为此，我们提出ClawBench，一个包含153项人们在生活与工作中需要定期完成的简单任务的评估框架，涵盖15个类别的144个在线平台，从完成购买、预约到提交工作申请。这些任务要求超越现有基准的能力，例如从用户提供的文档中获取相关信息、跨不同平台导航多步骤工作流，以及大量填写详细表格等写密集型操作。与现有在离线沙箱中使用静态页面评估代理的基准不同，ClawBench在实际生产环境中运行，保留了真实网络交互的全部复杂性、动态性和挑战。轻量级拦截层仅捕获并阻止最终提交请求，确保评估安全且无真实世界副作用。我们对7个前沿模型的评估显示，无论是专有模型还是开源模型，仅能完成其中一小部分任务。例如，Claude Sonnet 4.6仅达到33.3%的完成率。ClawBench上的进展使我们更接近能够作为可靠通用助手的AI代理。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

17+阅读 · 5月20日

【综述】智能体AI如何重塑软件开发生命周期：从代码补全到人类监督下的委托执行

专知会员服务

14+阅读 · 5月2日

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

EdgeRunner AI：在本地设备关键军事任务中实现GPT-5级性能表现（附论文）

专知会员服务

29+阅读 · 2025年11月19日