How to get better embeddings with code pre-trained models? An empirical study

Pre-trained language models have demonstrated powerful capabilities in the field of natural language processing (NLP). Recently, code pre-trained model (PTM), which draw from the experiences of the NLP field, have also achieved state-of-the-art results in many software engineering (SE) downstream tasks. These code PTMs take into account the differences between programming languages and natural languages during pre-training and make adjustments to pre-training tasks and input data. However, researchers in the SE community still inherit habits from the NLP field when using these code PTMs to generate embeddings for SE downstream classification tasks, such as generating semantic embeddings for code snippets through special tokens and inputting code and text information in the same way as pre-training the PTMs. In this paper, we empirically study five different PTMs (i.e. CodeBERT, CodeT5, PLBART, CodeGPT and CodeGen) with three different architectures (i.e. encoder-only, decoder-only and encoder-decoder) on four SE downstream classification tasks (i.e. code vulnerability detection, code clone detection, just-in-time defect prediction and function docstring mismatch detection) with respect to the two aforementioned aspects. Our experimental results indicate that (1) regardless of the architecture of the code PTMs used, embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet; (2) the quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information; (3) using the method that aggregates the vector representations of all code tokens, the decoder-only PTMs can obtain code embeddings with semantics as rich as or even better quality than those obtained from the encoder-only and encoder-decoder PTMs.

翻译：预训练语言模型在自然语言处理（NLP）领域展现出强大的能力。近年来，借鉴NLP领域经验的代码预训练模型（PTM）也在多项软件工程（SE）下游任务中取得了最先进的成果。这些代码PTM在预训练阶段考虑到编程语言与自然语言的差异，并对预训练任务和输入数据进行了相应调整。然而，SE领域的研究人员在使用这些代码PTM为SE下游分类任务生成嵌入表示时，仍沿用了NLP领域的习惯做法，例如通过特殊标记生成代码片段的语义嵌入，以及采用与预训练PTM相同的方式输入代码和文本信息。本文围绕上述两个关键方面，对五种不同架构的PTM（即仅编码器架构的CodeBERT、仅解码器架构的CodeGPT和CodeGen、编码器-解码器架构的CodeT5和PLBART）在四个SE下游分类任务（代码漏洞检测、代码克隆检测、即时缺陷预测和函数文档字符串不匹配检测）上进行了实证研究。实验结果表明：（1）无论使用何种代码PTM架构，通过特殊标记获取的嵌入表示均无法充分聚合整个代码片段的语义信息；（2）采用与预训练PTM相同方式混合代码数据和文本数据所得的代码嵌入质量较差，无法保证更丰富的语义信息；（3）采用聚合所有代码标记向量表示的方法时，仅解码器架构的PTM能够获得与仅编码器架构和编码器-解码器架构PTM同等甚至更优质量的语义丰富的代码嵌入。