Pretrained Language Models (PLMs) have emerged as the state-of-the-art paradigm for code search tasks. The paradigm involves pretraining the model on search-irrelevant tasks such as masked language modeling, followed by the finetuning stage, which focuses on the search-relevant task. The typical finetuning method is to employ a dual-encoder architecture to encode semantic embeddings of query and code separately, and then calculate their similarity based on the embeddings. However, the typical dual-encoder architecture falls short in modeling token-level interactions between query and code, which limits the model's capabilities. In this paper, we propose a novel approach to address this limitation, introducing a cross-encoder architecture for code search that jointly encodes the semantic matching of query and code. We further introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving. Moreover, we present a probabilistic hard negative sampling method to improve the cross-encoder's ability to distinguish hard negative codes, which further enhances the cascade RR framework. Experiments on four datasets using three code PLMs demonstrate the superiority of our proposed method.
翻译:预训练语言模型(PLMs)已成为代码搜索任务的最先进范式。该范式首先在掩码语言建模等与搜索无关的任务上对模型进行预训练,随后进入聚焦于搜索相关任务的微调阶段。典型的微调方法采用双编码器架构分别编码查询和代码的语义嵌入,并基于嵌入计算相似度。然而,双编码器架构在建模查询与代码间的词元级交互方面存在不足,限制了模型能力。本文提出一种新方法以克服此局限,引入交叉编码器架构用于代码搜索,该架构可联合编码查询与代码的语义匹配。我们进一步提出检索器-排序器(RR)框架,通过级联双编码器与交叉编码器来提升评估与在线服务的效率。此外,我们提出概率性难负样本采样方法,以增强交叉编码器区分难负代码的能力,从而进一步提升级联RR框架的性能。在四个数据集上基于三种代码PLMs的实验证明了所提方法的优越性。