Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains?

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains underexplored. In this work, we investigate the cross-domain generalization of an LLM agent equipped with a code interpreter tool, which is exclusively trained on mathematical problem-solving tasks. Despite the restricted training domain, we evaluate the agent's performance across several distinct reasoning domains. The results reveal that RL-based tool usage learned from mathematical tasks can be effectively transferred to complex tasks in other domains, enabling great task performance and high token efficiency. To facilitate this cross-domain transfer, we propose a Tool Generalization Reinforcement Learning (TGRL) framework designed to promote domain-agnostic learning and skill migration, encompassing: (i) a standardized tool interface that abstracts domain-specific nuances through consistent formatting and explicit termination, fostering transferable invocation patterns; (ii) a dual-component reward system that decomposes rewards to incentivize generalizable behaviors like tool efficiency and reasoning abstraction, ensuring alignment and robustness across domain shifts; and (iii) an XML-based prompt template that separates thinking, tool calls, and responses to encourage modular, domain-invariant planning and coherent multi-turn interactions. Extensive experiments across diverse benchmarks validate our approach, achieving state-of-the-art performance and highlighting the cross-domain potential of Tool RL for LLM reasoning.

翻译：近年来，大型语言模型（LLM）在推理和工具利用方面展现出卓越能力。然而，工具增强强化学习（RL）在不同领域间的泛化性能仍未得到充分探索。本研究探讨了配备代码解释器工具的LLM智能体在跨领域泛化中的表现，该智能体仅接受数学问题求解任务的训练。尽管训练领域受限，我们评估了该智能体在多个不同推理领域的性能。结果表明，从数学任务中学得的基于RL的工具使用能力，能够有效迁移至其他领域的复杂任务，实现出色的任务表现和高令牌效率。为促进这种跨领域迁移，我们提出了工具泛化强化学习（TGRL）框架，旨在促进领域无关学习和技能迁移，包含：（i）标准化工具接口，通过统一格式和显式终止抽象领域特定细节，培养可迁移的调用模式；（ii）双组件奖励系统，通过分解奖励激励可泛化行为（如工具效率和推理抽象），确保领域转换时的对齐性和鲁棒性；（iii）基于XML的提示模板，将思考、工具调用和响应分离，以鼓励模块化、领域无关的规划及连贯的多轮交互。跨多个基准测试的广泛实验验证了我们的方法，实现了最先进的性能，并突显了工具RL在LLM推理中的跨领域潜力。