AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.
翻译:AI代理或许能自动化您的邮箱,但它们能否自动化生活中其他常规事务?日常在线任务为评估下一代AI代理提供了真实且尚未解决的试验场。为此,我们提出ClawBench,一个包含153项人们在生活与工作中需要定期完成的简单任务的评估框架,涵盖15个类别的144个在线平台,从完成购买、预约到提交工作申请。这些任务要求超越现有基准的能力,例如从用户提供的文档中获取相关信息、跨不同平台导航多步骤工作流,以及大量填写详细表格等写密集型操作。与现有在离线沙箱中使用静态页面评估代理的基准不同,ClawBench在实际生产环境中运行,保留了真实网络交互的全部复杂性、动态性和挑战。轻量级拦截层仅捕获并阻止最终提交请求,确保评估安全且无真实世界副作用。我们对7个前沿模型的评估显示,无论是专有模型还是开源模型,仅能完成其中一小部分任务。例如,Claude Sonnet 4.6仅达到33.3%的完成率。ClawBench上的进展使我们更接近能够作为可靠通用助手的AI代理。