Zero-Shot Code Representation Learning via Prompt Tuning

Learning code representations has been the core prerequisite of many software engineering tasks such as code clone detection and code generation. State-of-the-art program representation techniques mainly utilize pre-trained language models (PLMs) such as CodeBERT. A Transformer encoder is firstly pre-trained on a large-scale code corpus to acquire general knowledge about source code. The pre-trained model is then fine-tuned on specific tasks using an amount of labeled data. However, gathering training samples for the downstream tasks can be prohibitively expensive and impractical for domain-specific languages or project-specific tasks. Besides, pre-training and downstream tasks are usually heterogeneous, which makes it difficult to fully explore the knowledge learned during pre-training. In this paper, we propose Zecoler, a zero-shot approach for learning code representations. Zecoler is built upon a pre-trained programming language model. In order to elicit knowledge from the PLMs efficiently, Zecoler casts the downstream tasks to the same form of pre-training objectives by inserting train-able prompts into the original input. These prompts can guide PLMs on how to generate better results. Subsequently, we employ the prompt tuning technique to search for the optimal prompts for PLMs automatically. This enables the representation model to efficiently fit the downstream tasks through fine-tuning on the dataset in source language domain and then reuse the pre-trained knowledge for the target domain in a zero-shot style. We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation. The results show that our approach significantly outperforms baseline models under the zero-shot setting.

翻译：学习代码表示是许多软件工程任务（如代码克隆检测和代码生成）的核心前提。当前最先进的程序表示技术主要依赖预训练语言模型（如CodeBERT）。首先，Transformer编码器在大规模代码语料库上进行预训练，以获取关于源代码的通用知识；随后，该预训练模型通过大量标注数据在特定任务上进行微调。然而，为下游任务收集训练样本可能成本高昂且不切实际，尤其在领域特定语言或项目特定任务中。此外，预训练任务与下游任务通常具有异构性，导致难以充分挖掘预训练阶段学到的知识。本文提出Zecoler——一种用于学习代码表示的零样本方法。Zecoler基于预训练的编程语言模型构建。为高效激发预训练语言模型中的知识，Zecoler通过在原始输入中插入可训练的提示，将下游任务转化为与预训练目标相同的形式。这些提示可引导预训练语言模型生成更优结果。随后，我们采用提示调优技术自动搜索预训练语言模型的最优提示。这使得表示模型能够通过在源语言领域数据集上的微调高效适配下游任务，并以零样本方式复用预训练知识于目标领域。我们在五项代码智能任务（包括代码克隆检测、代码搜索、方法名预测、代码摘要生成和代码生成）上评估了Zecoler。结果表明，在零样本设置下，我们的方法显著优于基线模型。