SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use environments. We introduce SynthTools, a fully LLM-based pipeline spanning the entire lifecycle: environment generation, simulation, validation and task construction. By operating end-to-end through LLMs, our framework complements other tool-use environments bottlenecked by the complexity of real APIs, and ensures scalability and controllability by design. The framework consists of three components: top-down environment generation, which hierarchically constructs diverse, domain-grounded tool environments; environment simulation and validation, which ensures tools can be reliably emulated and filters out those that cannot; and bottom-up task and trajectory generation, which produces solvable and verifiable tasks together with multi-step trajectories, exposing control over difficulty, length, trajectory composition, and domain focus to guarantee flexibility. As a concrete instantiation, we release the dataset comprising $73{,}883$ validated tools across $6{,}800$ environments and $100$ fields, $79{,}925$ verifiable tasks as well as the pipeline to generate trajectories at scale. Training Qwen3 models of various sizes on a corpus of trajectories generated from these tasks yields gains across multiple tool-use benchmarks, including real APIs, indicating tool-use capabilities trained on synthetic data may transfer to some real environments. Together, these results suggest that SynthTools can serve as a useful infrastructure for large-scale training of tool-use agents.

翻译：为了让智能体系统能够利用外部工具解决复杂、长期的任务，我们需要大量多样化且可控的工具使用环境。我们提出了SynthTools，一个完全基于大语言模型（LLM）的流水线，覆盖了从环境生成、仿真、验证到任务构建的完整生命周期。通过端到端地利用LLM运行，我们的框架补充了其他受限于真实API复杂度的工具使用环境，并从根本上确保了可扩展性和可控性。该框架由三个组件构成：自上而下的环境生成（分层构建多样化、领域相关的工具环境）、环境仿真与验证（确保工具可被可靠模拟并过滤无效工具），以及自下而上的任务与轨迹生成（生成可解且可验证的任务及多步轨迹，并可控制难度、轨迹长度、轨迹构成及领域重点以保证灵活性）。作为具体实例，我们发布了包含跨越6800个环境与100个领域的73883个经过验证的工具、79925个可验证任务的数据集，以及用于大规模生成轨迹的流水线。在不同规模的Qwen3模型上使用这些任务生成的轨迹语料库进行训练，在包括真实API在内的多个工具使用基准测试中均取得了性能提升，这表明基于合成数据训练的工具使用能力可能迁移至部分真实环境。综上，这些结果表明，SynthTools可作为大规模训练工具使用智能体的有效基础设施。