Automatic code generation has been a longstanding research topic. With the advancement of general-purpose large language models (LLMs), the ability to code stands out as one important measure to the model's reasoning performance. Usually, a two-stage training paradigm is implemented to obtain a Code LLM, namely the pretraining and the fine-tuning. Within the fine-tuning, supervised fine-tuning (SFT), and reinforcement learning (RL) are often used to improve the model's zero-shot ability. A large number of work has been conducted to improve the model's performance on code-related benchmarks with either modifications to the algorithm or refinement of the dataset. However, we still lack a deep insight into the correlation between SFT and RL. For instance, what kind of dataset should be used to ensure generalization, or what if we abandon the SFT phase in fine-tuning. In this work, we make an attempt to understand the correlation between SFT and RL. To facilitate our research, we manually craft 100 basis python functions, called atomic functions, and then a synthesizing pipeline is deployed to create a large number of synthetic functions on top of the atomic ones. In this manner, we ensure that the train and test sets remain distinct, preventing data contamination. Through comprehensive ablation study, we find: (1) Both atomic and synthetic functions are indispensable for SFT's generalization, and only a handful of synthetic functions are adequate; (2) Through RL, the SFT's generalization to target domain can be greatly enhanced, even with the same training prompts; (3) Training RL from scratch can alleviate the over-fitting issue introduced in the SFT phase.
翻译:自动代码生成一直是长期的研究课题。随着通用大语言模型的发展,编程能力已成为衡量模型推理性能的重要指标之一。通常采用两阶段训练范式获得代码大语言模型,即预训练与微调。在微调阶段,监督微调与强化学习常被用于提升模型的零样本能力。大量研究通过改进算法或优化数据集来提升模型在代码基准测试中的性能。然而,我们仍缺乏对监督微调与强化学习之间关联的深入理解。例如,应使用何种数据集以确保泛化能力,或若在微调阶段放弃监督微调会产生何种影响。本研究尝试揭示监督微调与强化学习之间的关联机制。为推进研究,我们手动构建了100个基础Python函数(称为原子函数),随后通过合成流程在原子函数基础上生成大量合成函数。这种方法确保了训练集与测试集的分离,避免了数据污染。通过全面的消融实验,我们发现:(1)原子函数与合成函数对监督微调的泛化能力均不可或缺,且仅需少量合成函数即可满足需求;(2)通过强化学习,即使使用相同的训练提示,也能显著增强监督微调在目标领域的泛化能力;(3)从零开始训练强化学习可缓解监督微调阶段引入的过拟合问题。