Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models

Automatic code generation has been a longstanding research topic. With the advancement of general-purpose large language models (LLMs), the ability to code stands out as one important measure to the model's reasoning performance. Usually, a two-stage training paradigm is implemented to obtain a Code LLM, namely the pretraining and the fine-tuning. Within the fine-tuning, supervised fine-tuning (SFT), and reinforcement learning (RL) are often used to improve the model's zero-shot ability. A large number of work has been conducted to improve the model's performance on code-related benchmarks with either modifications to the algorithm or refinement of the dataset. However, we still lack a deep insight into the correlation between SFT and RL. For instance, what kind of dataset should be used to ensure generalization, or what if we abandon the SFT phase in fine-tuning. In this work, we make an attempt to understand the correlation between SFT and RL. To facilitate our research, we manually craft 100 basis python functions, called atomic functions, and then a synthesizing pipeline is deployed to create a large number of synthetic functions on top of the atomic ones. In this manner, we ensure that the train and test sets remain distinct, preventing data contamination. Through comprehensive ablation study, we find: (1) Both atomic and synthetic functions are indispensable for SFT's generalization, and only a handful of synthetic functions are adequate; (2) Through RL, the SFT's generalization to target domain can be greatly enhanced, even with the same training prompts; (3) Training RL from scratch can alleviate the over-fitting issue introduced in the SFT phase.

翻译：自动代码生成长期以来一直是研究热点。随着通用大语言模型（LLMs）的发展，编程能力已成为衡量模型推理性能的重要指标之一。通常，获取代码大语言模型需采用两阶段训练范式，即预训练与微调。在微调阶段，监督微调（SFT）和强化学习（RL）常被用于提升模型的零样本能力。已有大量研究通过改进算法或优化数据集来提升模型在代码相关基准测试中的性能。然而，我们仍缺乏对SFT与RL之间关联性的深入理解。例如，应使用何种数据集以确保泛化能力，或若在微调阶段舍弃SFT会产生何种影响。本工作中，我们尝试探究SFT与RL之间的关联。为支持研究，我们手动构建了100个基础Python函数（称为原子函数），随后通过合成流程在原子函数基础上生成大量合成函数。此方法确保了训练集与测试集的分离，避免了数据污染。通过全面的消融实验，我们发现：（1）原子函数与合成函数对SFT的泛化能力均不可或缺，且仅需少量合成函数即可满足需求；（2）通过RL，即使使用相同的训练提示，SFT对目标领域的泛化能力也能显著增强；（3）从零开始训练RL可缓解SFT阶段引入的过拟合问题。