Identical and Fraternal Twins: Fine-Grained Semantic Contrastive Learning of Sentence Representations

The enhancement of unsupervised learning of sentence representations has been significantly achieved by the utility of contrastive learning. This approach clusters the augmented positive instance with the anchor instance to create a desired embedding space. However, relying solely on the contrastive objective can result in sub-optimal outcomes due to its inability to differentiate subtle semantic variations between positive pairs. Specifically, common data augmentation techniques frequently introduce semantic distortion, leading to a semantic margin between the positive pair. While the InfoNCE loss function overlooks the semantic margin and prioritizes similarity maximization between positive pairs during training, leading to the insensitive semantic comprehension ability of the trained model. In this paper, we introduce a novel Identical and Fraternal Twins of Contrastive Learning (named IFTCL) framework, capable of simultaneously adapting to various positive pairs generated by different augmentation techniques. We propose a \textit{Twins Loss} to preserve the innate margin during training and promote the potential of data enhancement in order to overcome the sub-optimal issue. We also present proof-of-concept experiments combined with the contrastive objective to prove the validity of the proposed Twins Loss. Furthermore, we propose a hippocampus queue mechanism to restore and reuse the negative instances without additional calculation, which further enhances the efficiency and performance of the IFCL. We verify the IFCL framework on nine semantic textual similarity tasks with both English and Chinese datasets, and the experimental results show that IFCL outperforms state-of-the-art methods.

翻译：无监督句子表示学习的增强已通过对比学习的应用取得了显著进展。该方法将增强的正样本与锚点样本聚类，以构建理想的嵌入空间。然而，仅依赖对比目标可能导致次优结果，因其无法区分正样本对之间的细微语义变化。具体而言，常见的数据增强技术常引入语义扭曲，导致正样本对之间存在语义差距。尽管InfoNCE损失函数在训练过程中忽略语义差距并优先最大化正样本对之间的相似性，却使得训练后的模型语义理解能力不敏感。本文提出了一种新颖的“同卵与异卵双胞胎对比学习”（命名为IFTCL）框架，能够同时适应不同增强技术生成的各种正样本对。我们提出了一种“双胞胎损失”（Twins Loss），在训练过程中保留固有的语义差距并促进数据增强的潜力，以克服次优问题。我们结合对比目标进行了概念验证实验，证明了所提出的Twins Loss的有效性。此外，我们提出了一种海马队列机制，无需额外计算即可恢复并重用负样本，进一步增强了IFTCL的效率和性能。我们在九个语义文本相似性任务上使用中英文数据集验证了IFTCL框架，实验结果表明IFTCL优于现有最先进方法。