The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.
翻译:具备工具使用能力的大语言模型智能体的发展需要多样且复杂的训练语料。现有数据生成方法主要遵循随机采样和浅层生成的范式,往往产生简单且同质化的轨迹,难以捕捉复杂、隐式的逻辑依赖关系。为弥补这一差距,我们提出了HardGen——一个旨在生成具有可验证推理过程的困难工具使用训练样本的自动化智能体流程。首先,HardGen基于智能体失败案例构建动态API图,并从中采样以合成困难轨迹。其次,这些轨迹作为条件先验,指导模块化、抽象的高级工具的实例化,这些高级工具随后被用于构建困难查询。最后,高级工具与困难查询共同支持生成可验证的复杂思维链,并通过闭环评估反馈持续优化该流程。大量评估表明,使用我们构建的数据集训练的4B参数模型,在性能上优于多个领先的开源与闭源竞争对手(例如GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5)。我们的代码、模型和数据集将开源以促进未来研究。