Large code datasets have become increasingly accessible for pre-training source code models. However, for the fine-tuning phase, obtaining representative training data that fully covers the code distribution for specific downstream tasks remains challenging due to the task-specific nature and limited labeling resources. Moreover, fine-tuning pretrained models can result in forgetting previously acquired pre-training knowledge. These lead to out-of-distribution (OOD) generalization issues with unexpected model inference behaviors that have not been systematically studied yet. In this paper, we contribute the first systematic approach that simulates various OOD scenarios along different dimensions of source code data properties and study the fine-tuned model behaviors in such scenarios. We investigate the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods. Our comprehensive analysis, conducted on four state-of-the-art pretrained models and applied to two code generation tasks, exposes multiple failure modes attributed to OOD generalization issues. Additionally, our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
翻译:大规模代码数据集在源代码模型预训练中已变得日益可获取。然而,在微调阶段,由于下游特定任务的特异性及标注资源的有限性,获取能够完全覆盖代码分布的代表性训练数据仍具挑战。此外,微调预训练模型可能导致先前获得的预训练知识被遗忘。这些问题引发了分布外(OOD)泛化问题,并伴随着模型推理行为的意外表现——对此尚未有系统研究。本文首次提出系统性方法,沿源代码数据属性的不同维度模拟多种OOD场景,并研究此类场景下微调模型的行为。我们考察了不同微调方法(包括全微调和低秩适应(LoRA)微调方法)下模型的行为表现。基于四个最先进预训练模型、应用于两项代码生成任务的综合分析,揭示了归因于OOD泛化问题的多种故障模式。此外,我们的分析发现,在各种场景中,LoRA微调在OOD泛化性能上始终显著优于全微调。