Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.
翻译:为训练和评估爪式智能体构建环境仍是一个依赖人工、无法扩展的手动过程。我们认为,当前所需的不仅是一个数据集,更是一条能够按需生成多样化且经过验证的环境的自动化流水线。为此,我们提出ClawEnvKit——一条从自然语言描述出发实现该形式化过程的自主生成流水线。该流水线包含三个模块:(1) 解析器,从自然语言输入中提取结构化生成参数;(2) 生成器,生成任务规范、工具接口及评分配置;(3) 验证器,确保所生成环境在可行性、多样性、结构有效性和内部一致性方面满足要求。利用ClawEnvKit,我们构建了首个面向爪式智能体的大规模基准Auto-ClawEval,包含覆盖24个类别的1040个环境。实验表明,Auto-ClawEval在连贯性和清晰度上达到或超越人工构建环境,而成本降低至1/13800。在4个模型家族和8种智能体框架上的评估显示:框架工程相较原始ReAct基线可提升高达15.7个百分点的性能;完成率仍是主要变化维度,且尚无模型在该基准上达到饱和;自动化生成使得先前不可行的规模化评估成为可能。除静态基准测试外,ClawEnvKit还支持动态评估:用户用自然语言描述所需能力即可按需获得经过验证的环境,使评估成为持续的用户驱动过程。同样的机制也可作为按需训练环境生成器,产生适应智能体当前弱点的任务分布,而非受限于既有用户日志。