Several prior studies have suggested that word frequency biases can cause the Bert model to learn indistinguishable sentence embeddings. Contrastive learning schemes such as SimCSE and ConSERT have already been adopted successfully in unsupervised sentence embedding to improve the quality of embeddings by reducing this bias. However, these methods still introduce new biases such as sentence length bias and false negative sample bias, that hinders model's ability to learn more fine-grained semantics. In this paper, we reexamine the challenges of contrastive sentence embedding learning from a debiasing perspective and argue that effectively eliminating the influence of various biases is crucial for learning high-quality sentence embeddings. We think all those biases are introduced by simple rules for constructing training data in contrastive learning and the key for contrastive learning sentence embedding is to mimic the distribution of training data in supervised machine learning in unsupervised way. We propose a novel contrastive framework for sentence embedding, termed DebCSE, which can eliminate the impact of these biases by an inverse propensity weighted sampling method to select high-quality positive and negative pairs according to both the surface and semantic similarity between sentences. Extensive experiments on semantic textual similarity (STS) benchmarks reveal that DebCSE significantly outperforms the latest state-of-the-art models with an average Spearman's correlation coefficient of 80.33% on BERTbase.
翻译:先前的研究指出,词频偏差会导致BERT模型学习到难以区分的句子嵌入。SimCSE和ConSERT等对比学习方案已成功应用于无监督句子嵌入,通过减少这种偏差来提升嵌入质量。然而,这些方法仍会引入新的偏差,例如句子长度偏差和假负样本偏差,从而阻碍模型学习更细粒度的语义。本文从去偏视角重新审视对比句子嵌入学习的挑战,认为有效消除各种偏差的影响是学习高质量句子嵌入的关键。我们认为,所有这些偏差均源于对比学习中构建训练数据的简单规则,而对比学习句子嵌入的核心在于以无监督方式模拟监督机器学习中训练数据的分布。为此,我们提出了一种新颖的对比学习框架——DebCSE,它通过基于句子间表面相似性和语义相似性的逆概率加权采样方法来选择高质量的正负样本对,从而消除这些偏差的影响。在语义文本相似度(STS)基准测试上的大量实验表明,DebCSE在BERTbase上以80.33%的平均Spearman相关系数显著优于最新模型。