CoCoSoDa: Effective Contrastive Learning for Code Search

Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.

翻译：代码搜索旨在针对给定的自然语言查询检索语义相关的代码片段。近期，许多采用对比学习的方法在代码表示学习上展现了令人期待的结果，并显著提升了代码搜索的性能。然而，在代码搜索中应用对比学习仍存在大量改进空间。本文提出CoCoSoDa，通过对比学习中的两个关键因素——数据增强和负样本——有效利用对比学习进行代码搜索。具体而言，软数据增强通过对输入序列动态掩码或替换部分词元为其类型来生成正样本；动量机制则通过维护一个队列和动量编码器，在小型批处理中生成大规模且一致的负样本表示。此外，多模态对比学习用于拉近代码-查询对的表示，并推远未配对的代码片段与查询。我们在包含六种编程语言的大规模数据集上开展了广泛实验以评估方法有效性。实验结果表明：（1）CoCoSoDa超越14个基线方法，尤其在平均MRR分数上分别超过CodeBERT、GraphCodeBERT和UniXcoder达13.3%、10.5%和5.9%。（2）消融实验验证了方法各组成部分的有效性。（3）我们将技术适配至RoBERTa、CodeBERT和GraphCodeBERT等不同预训练模型，观察到其在代码搜索性能上的显著提升。（4）模型在不同超参数下表现稳健。此外，我们通过定性和定量分析探究模型优异表现背后的原因。