Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative'' than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. Source code is available at \url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}.
翻译:最近,对比学习已成为微调代码搜索模型以提升软件开发效率与效果的关键组成部分。它通过拉近查询语句与正例代码片段之间的距离,同时推远负样本。在对比学习中,InfoNCE因其更优性能成为最广泛使用的损失函数。然而,InfoNCE 中负样本的以下问题可能会损害其表示学习能力:1) 大型代码语料库中因代码重复而存在的假负样本;2) 未能显式区分负样本的潜在相关性。例如,对于快速排序算法的查询而言,冒泡排序算法示例的“负向性”低于文件保存函数。本文通过提出简单而有效的Soft-InfoNCE损失函数解决上述问题,该函数在InfoNCE中引入权重项。在所提损失函数中,我们采用三种方法估计负样本对的权重,并证明原始InfoNCE损失是Soft-InfoNCE的特例。理论上,我们分析了Soft-InfoNCE对控制学习到的代码表示分布以及推导更精确互信息估计的影响。此外,我们讨论了所提损失函数相较于其他设计替代方案的优越性。大量实验表明,在包含六种编程语言的大规模公开数据集上,基于最先进的代码搜索模型,Soft-InfoNCE及其权重估计方法具有显著有效性。源代码地址:\url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}。