Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
翻译:Claw类环境支持基于本地文件、工具和持久化工作区状态的多步骤工作流。然而,围绕这类环境的可扩展开发仍受限于缺乏系统性框架,特别是缺乏用于合成可验证训练数据并将其与智能体训练及诊断评估相集成的框架。针对这一挑战,我们提出ClawGym——一个支持Claw类个人智能体完整开发生命周期的可扩展框架。具体而言,我们构建了ClawGym-SynData数据集,其中包含13.5K个经筛选的任务,这些任务由人物驱动意图和技能导向操作合成,并配备了逼真的模拟工作区和混合验证机制。随后,我们通过在黑盒展开轨迹上进行监督微调,训练了一族具备能力的Claw类模型(称为ClawGym-Agents),并进一步探索了通过轻量级流水线(跨任务沙箱并行展开)实现的强化学习方法。为支持可靠评估,我们进一步构建了ClawGym-Bench基准测试,该基准包含200个实例,并经过自动过滤与人类-LLM联合审查校准。相关资源将发布于https://github.com/ClawGym。