Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.
翻译:当前针对代码任务的语言模型通常沿用自然语言处理中的预训练-微调范式,将源代码建模为纯文本。然而,这种方法忽略了编程语言中固有的明确结构。在本工作中,我们通过利用程序结构对预训练代码模型进行进一步预训练和微调,探索其数据高效适配方法。具体而言,我们将程序表示为解析树(也称具体语法树,CST),并基于序列化的CST对预训练模型进行适配。尽管我们适配的模型仅以程序的表层形式进行过预训练,但研究发现,在不改变模型架构的情况下,对CST进行少量持续预训练和微调即可在各种代码任务上取得优于基线方法的效果。当训练样本有限时,这种改进尤为显著,这证明了即使使用未经结构预训练的主干模型,将程序结构与纯文本表示相结合的有效性。