Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination

Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of <anchor, positive, negative> is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, which may not be true negatives. In other words, the samples with high similarity but not paired with the anchor may reserve positive semantic associations, and we call them false negatives. Repelling these false negatives in triplet loss would mislead the semantic representation learning and result in inferior retrieval performance. In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. Specifically, we first construct the distributions of positive and negative samples separately via their similarities with the anchor, based on the features extracted from image and text encoders. Then we calculate the false negative probability of a given sample based on its similarity with the anchor and the above distributions via the Bayes' rule, which is employed as the sampling weight during negative sampling process. Since there may not exist any false negative in a small batch size, we design a memory module with momentum to retain a large negative buffer and implement our negative sampling strategy spanning over the buffer. In addition, to make the model focus on hard negatives, we reassign the sampling weights for the simple negatives with a cut-down strategy. The extensive experiments are conducted on Flickr30K and MS-COCO, and the results demonstrate the superiority of our proposed false negative elimination strategy. The code is available at https://github.com/LuminosityX/FNE.

翻译：现有的大多数图像-文本匹配方法采用三元组损失作为优化目标，而选择合适的负样本用于<锚点、正样本、负样本>三元组对有效训练模型至关重要，例如难负样本能使模型高效学习。然而，我们观察到现有方法主要选用最相似的样本作为难负样本，但这些样本可能并非真正的负样本。换言之，与锚点高度相似但未配对的样本可能保留着正向语义关联，我们将其称为假负样本。在三元组损失中排斥这些假负样本会误导语义表征学习，导致检索性能下降。本文提出了一种新颖的假负样本消除（FNE）策略，通过采样方式选择负样本，从而缓解假负样本带来的问题。具体而言，我们首先基于图像和文本编码器提取的特征，根据样本与锚点的相似度分别构建正负样本的分布。随后利用贝叶斯规则，根据给定样本与锚点的相似度以及上述分布计算其假负概率，并将其作为负采样过程中的采样权重。由于小批量数据中可能不存在假负样本，我们设计了一个带有动量的记忆模块来维护大型负样本缓冲区，并在该缓冲区上实现负采样策略。此外，为促使模型关注难负样本，我们采用削减策略重新分配简单负样本的采样权重。在Flickr30K和MS-COCO数据集上进行了大量实验，结果证明了所提假负样本消除策略的优越性。代码已开源在https://github.com/LuminosityX/FNE。