Rethinking Negative Pairs in Code Search

Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative'' than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. Source code is available at \url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}.

翻译：最近，对比学习已成为微调代码搜索模型以提升软件开发效率与效果的关键组成部分。它通过拉近查询语句与正例代码片段之间的距离，同时推远负样本。在对比学习中，InfoNCE因其更优性能成为最广泛使用的损失函数。然而，InfoNCE 中负样本的以下问题可能会损害其表示学习能力：1) 大型代码语料库中因代码重复而存在的假负样本；2) 未能显式区分负样本的潜在相关性。例如，对于快速排序算法的查询而言，冒泡排序算法示例的“负向性”低于文件保存函数。本文通过提出简单而有效的Soft-InfoNCE损失函数解决上述问题，该函数在InfoNCE中引入权重项。在所提损失函数中，我们采用三种方法估计负样本对的权重，并证明原始InfoNCE损失是Soft-InfoNCE的特例。理论上，我们分析了Soft-InfoNCE对控制学习到的代码表示分布以及推导更精确互信息估计的影响。此外，我们讨论了所提损失函数相较于其他设计替代方案的优越性。大量实验表明，在包含六种编程语言的大规模公开数据集上，基于最先进的代码搜索模型，Soft-InfoNCE及其权重估计方法具有显著有效性。源代码地址：\url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}。

相关内容

损失函数（机器学习）

关注 10

损失函数，在AI中亦称呼距离函数，度量函数。此处的距离代表的是抽象性的，代表真实数据与预测数据之间的误差。损失函数（loss function）是用来估量你模型的预测值f(x)与真实值Y的不一致程度，它是一个非负实值函数,通常使用L(Y, f(x))来表示，损失函数越小，模型的鲁棒性就越好。损失函数是经验风险函数的核心部分，也是结构风险函数重要组成部分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日