Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models

Automatic code generation has been a longstanding research topic. With the advancement of general-purpose large language models (LLMs), the ability to code stands out as one important measure to the model's reasoning performance. Usually, a two-stage training paradigm is implemented to obtain a Code LLM, namely the pretraining and the fine-tuning. Within the fine-tuning, supervised fine-tuning (SFT), and reinforcement learning (RL) are often used to improve the model's zero-shot ability. A large number of work has been conducted to improve the model's performance on code-related benchmarks with either modifications to the algorithm or refinement of the dataset. However, we still lack a deep insight into the correlation between SFT and RL. For instance, what kind of dataset should be used to ensure generalization, or what if we abandon the SFT phase in fine-tuning. In this work, we make an attempt to understand the correlation between SFT and RL. To facilitate our research, we manually craft 100 basis python functions, called atomic functions, and then a synthesizing pipeline is deployed to create a large number of synthetic functions on top of the atomic ones. In this manner, we ensure that the train and test sets remain distinct, preventing data contamination. Through comprehensive ablation study, we find: (1) Both atomic and synthetic functions are indispensable for SFT's generalization, and only a handful of synthetic functions are adequate; (2) Through RL, the SFT's generalization to target domain can be greatly enhanced, even with the same training prompts; (3) Training RL from scratch can alleviate the over-fitting issue introduced in the SFT phase.

翻译：自动代码生成一直是长期的研究课题。随着通用大语言模型的发展，编程能力已成为衡量模型推理性能的重要指标之一。通常采用两阶段训练范式获得代码大语言模型，即预训练与微调。在微调阶段，监督微调与强化学习常被用于提升模型的零样本能力。大量研究通过改进算法或优化数据集来提升模型在代码基准测试中的性能。然而，我们仍缺乏对监督微调与强化学习之间关联的深入理解。例如，应使用何种数据集以确保泛化能力，或若在微调阶段放弃监督微调会产生何种影响。本研究尝试揭示监督微调与强化学习之间的关联机制。为推进研究，我们手动构建了100个基础Python函数（称为原子函数），随后通过合成流程在原子函数基础上生成大量合成函数。这种方法确保了训练集与测试集的分离，避免了数据污染。通过全面的消融实验，我们发现：（1）原子函数与合成函数对监督微调的泛化能力均不可或缺，且仅需少量合成函数即可满足需求；（2）通过强化学习，即使使用相同的训练提示，也能显著增强监督微调在目标领域的泛化能力；（3）从零开始训练强化学习可缓解监督微调阶段引入的过拟合问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/