THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

翻译：大型语言模型（LLM）在数学推理方面取得了显著进展，但在数值计算和形式化符号运算等高精度任务上仍面临困难。集成外部工具已成为弥补这一差距的有效途径。尽管近期研究有所突破，现有方法仍面临三个关键挑战：构建工具集成的推理数据、实现细粒度优化以及增强推理能力。为克服这些局限性，我们提出THOR（基于强化学习的工具集成分层优化方法）。首先，我们引入TIRGen——一种基于多智能体的高质量工具集成推理路径数据集构建流程，该方法与策略对齐且能良好泛化至不同模型。其次，为实现细粒度分层优化，我们提出一种联合优化任务级问题解决与步骤级代码生成的强化学习策略。该策略源于我们的关键发现：中间工具调用的成功率是最终答案正确性的强预测指标。最后，THOR融合了自校正机制，通过实时工具反馈在推理过程中动态修正错误推理路径。我们的方法在不同模型间展现出强大的泛化能力，在推理型与非推理型模型中均表现优异。该方法在多个数学基准测试中实现了同规模模型的领先性能，同时在代码基准测试中取得持续改进。相关代码已公开于https://github.com/JingMog/THOR。