Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such as math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones, increasing the average RL gains from merely +3.4% to +9.3% across Qwen-2.5 models (3/7/14B). Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, outperforming text-only GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%). Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
翻译:目前,关于如何训练大语言模型(LLMs)以在多样化任务中有效利用代码解释器的实用指南仍显匮乏。为此,我们提出R1-Code-Interpreter——一种基于纯文本LLM的扩展模型,通过多轮监督微调(SFT)与强化学习(RL)训练,使其能在逐步推理过程中自主生成多个代码查询。与先前聚焦于数学或检索等窄领域的RL+工具使用研究不同,我们精心筛选了144项涵盖推理与规划的多样化任务,并发现由于任务异质性与有效样本稀缺,训练通用型代码解释器面临显著挑战。针对这一问题,我们引入多阶段课程学习策略:根据测量到的改进潜力对训练样本进行划分,使RL训练优先处理高潜力样本并逐步转向低潜力样本,从而将Qwen-2.5系列模型(3/7/14B参数版本)的平均RL增益从仅+3.4%提升至+9.3%。最终模型R1-CI-14B在37项测试任务上的平均准确率从44.1%提升至72.4%,超越纯文本GPT-4o(58.6%)和配备代码解释器的GPT-4o(70.9%)。值得注意的是,R1-CI-14B还通过代码生成展现出涌现性的自我检查行为。数据集、代码及模型已开源至https://github.com/yongchao98/R1-Code-Interpreter 和 https://huggingface.co/yongchao98。