Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches depend on high-quality synthesized trajectories selected by scoring functions and sparse outcome-based rewards, providing limited and biased supervision for learning TIR. To address these challenges, in this paper, we propose AutoTraj, a two-stage framework that automatically learns TIR by repairing and rewarding tool-use trajectories. Specifically, in the supervised fine-tuning (SFT) stage, AutoTraj generates multiple candidate tool-use trajectories for each query and evaluates them along multiple dimensions. High-quality trajectories are directly retained, while low-quality ones are repaired using a LLM (i.e., LLM-as-Repairer). The resulting repaired and high-quality trajectories form a synthetic SFT dataset, while each repaired trajectory paired with its original low-quality counterpart constitutes a dataset for trajectory preference modeling. In the reinforcement learning (RL) stage, based on the preference dataset, we train a trajectory-level reward model to assess the quality of reasoning paths and combine it with outcome and format rewards, thereby explicitly guiding the optimization toward reliable TIR behaviors. Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj in TIR.
翻译:工具集成推理使大型语言模型能够通过与外部工具交互来解决复杂任务,然而现有方法依赖于通过评分函数筛选的高质量合成轨迹以及稀疏的基于结果的奖励机制,这为学习工具集成推理提供了有限且存在偏差的监督。为应对这些挑战,本文提出AutoTraj——一个通过修复与奖励工具使用轨迹来自动学习工具集成推理的两阶段框架。具体而言,在有监督微调阶段,AutoTraj为每个查询生成多个候选工具使用轨迹,并从多个维度对其进行评估。高质量轨迹被直接保留,而低质量轨迹则通过大型语言模型进行修复。修复后的轨迹与原始高质量轨迹共同构成合成的有监督微调数据集,同时每个修复轨迹与其对应的原始低质量轨迹配对形成轨迹偏好建模数据集。在强化学习阶段,基于偏好数据集,我们训练轨迹级奖励模型以评估推理路径的质量,并将其与结果奖励及格式奖励相结合,从而显式引导模型优化方向以实现可靠的工具集成推理行为。在现实世界基准测试上的实验验证了AutoTraj在工具集成推理中的有效性。