LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.
翻译:大语言模型智能体被期望能够跨软件工具、商业服务及本地工作区完成端到端的工作单元。然而,许多智能体基准在发布时固化了精心设计的任务集,并主要依据最终响应进行评分,这使得难以评估智能体应对演变工作流需求的能力,也无法验证任务是否真正执行。我们提出Claw-Eval-Live——一个面向工作流智能体的实时基准,它将可刷新的信号层(通过公共工作流需求信号跨版本更新)与可复现的、带时间戳的发布快照相分离。每个版本均基于公共工作流需求信号构建,当前版本采用ClawHub Top-500技能集,并具体化为受控任务,包含固定的测试夹具、服务、工作区及评分器。在评分方面,Claw-Eval-Live记录执行轨迹、审计日志、服务状态及运行后工作区产物,当证据充分时采用确定性检查,仅在语义维度上使用结构化LLM评判。该基准包含105个任务,涵盖受控商业服务与本地工作区修复,并在统一公共通过规则下评估13个前沿模型。实验表明,可靠的工作流自动化远未解决:领先模型仅通过66.7%的任务,无模型达到70%。失败模式按任务族及执行层面呈现结构化特征,人力资源、管理及多系统商业工作流是持续瓶颈,而本地工作区修复相对容易但尚未饱和。仅凭排行榜排名不够充分,因为通过率相近的模型在整体完成度上可能存在差异,且任务级区分能力集中于中间任务带。Claw-Eval-Live表明,工作流智能体评估应双重重置——基于新鲜的外部需求与可验证的智能体行为。