Contrastive learning has been widely studied in sentence representation learning. However, earlier works mainly focus on the construction of positive examples, while in-batch samples are often simply treated as negative examples. This approach overlooks the importance of selecting appropriate negative examples, potentially leading to a scarcity of hard negatives and the inclusion of false negatives. To address these issues, we propose ClusterNS (Clustering-aware Negative Sampling), a novel method that incorporates cluster information into contrastive learning for unsupervised sentence representation learning. We apply a modified K-means clustering algorithm to supply hard negatives and recognize in-batch false negatives during training, aiming to solve the two issues in one unified framework. Experiments on semantic textual similarity (STS) tasks demonstrate that our proposed ClusterNS compares favorably with baselines in unsupervised sentence representation learning. Our code has been made publicly available.
翻译:对比学习在句子表示学习中已得到广泛研究。然而,早期工作主要关注正样本的构建,而批次内样本通常被简单视为负样本。这种方法忽视了选择合适负样本的重要性,可能导致难负样本稀缺以及假阴性样本的混入。为解决这些问题,我们提出ClusterNS(聚类感知负采样),这是一种新颖的方法,将聚类信息融入对比学习,用于无监督句子表示学习。我们应用改进的K-means聚类算法在训练过程中提供难负样本并识别批次内的假阴性样本,旨在统一框架内解决这两个问题。在语义文本相似度(STS)任务上的实验表明,我们提出的ClusterNS在无监督句子表示学习中优于基线方法。我们的代码已公开。