Researchers have explored the potential of utilizing pre-trained language models, such as CodeBERT, to improve source code-related tasks. Previous studies have mainly relied on CodeBERT's text embedding capability and the `[CLS]' sentence embedding information as semantic representations for fine-tuning downstream source code-related tasks. However, these methods require additional neural network layers to extract effective features, resulting in higher computational costs. Furthermore, existing approaches have not leveraged the rich knowledge contained in both source code and related text, which can lead to lower accuracy. This paper presents a novel approach, CodePrompt, which utilizes rich knowledge recalled from a pre-trained model by prompt learning and an attention mechanism to improve source code-related classification tasks. Our approach initially motivates the language model with prompt information to retrieve abundant knowledge associated with the input as representative features, thus avoiding the need for additional neural network layers and reducing computational costs. Subsequently, we employ an attention mechanism to aggregate multiple layers of related knowledge for each task as final features to boost their accuracy. We conducted extensive experiments on four downstream source code-related tasks to evaluate our approach and our results demonstrate that CodePrompt achieves new state-of-the-art performance on the accuracy metric while also exhibiting computation cost-saving capabilities.
翻译:研究者已探索利用预训练语言模型(如CodeBERT)提升源代码相关任务的潜力。以往研究主要依赖CodeBERT的文本嵌入能力及`[CLS]`句子嵌入信息作为语义表征,用于微调下游源代码相关任务。然而,这些方法需要额外的神经网络层来提取有效特征,导致计算成本较高。此外,现有方法未充分利用源代码及相关文本中蕴含的丰富知识,可能影响准确率。本文提出一种新方法CodePrompt,通过提示学习从预训练模型中召回丰富知识,并借助注意力机制改进源代码相关分类任务。该方法首先利用提示信息激发语言模型,检索与输入相关的代表性知识作为特征,从而避免额外神经网络层需求并降低计算成本。随后,我们采用注意力机制聚合每项任务的多元知识层作为最终特征,以提升准确率。我们在四项下游源代码相关任务上开展了广泛实验以评估本方法,结果表明CodePrompt在准确率指标上达到新的最优性能,同时展现出节约计算成本的能力。