Contrastive learning has been the dominant approach to train state-of-the-art sentence embeddings. Previous studies have typically learned sentence embeddings either through the use of human-annotated natural language inference (NLI) data or via large-scale unlabeled sentences in an unsupervised manner. However, even in the case of unlabeled data, their acquisition presents challenges in certain domains due to various reasons. To address these issues, we present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data. Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning, including (1) producing positive and negative annotations given unlabeled sentences (SynCSE-partial), and (2) generating sentences along with their corresponding annotations from scratch (SynCSE-scratch). Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines, and SynCSE-partial even achieves comparable performance to the supervised models in most settings.
翻译:对比学习已成为训练最先进句子嵌入的主流方法。以往研究通常通过使用人工标注的自然语言推理数据或在大规模无标签句子上以无监督方式学习句子嵌入。然而,即便使用无标签数据,由于多种原因,在某些领域获取此类数据仍面临挑战。为解决这些问题,我们提出SynCSE——一种利用合成数据进行句子嵌入训练的对比学习框架。具体而言,我们探索利用大语言模型合成对比学习所需的数据样本,包括:(1)在给定无标签句子的情况下生成正负标注(SynCSE-partial),以及(2)从头生成句子及其对应标注(SynCSE-scratch)。在句子相似度与重排序任务上的实验结果表明,SynCSE-partial和SynCSE-scratch均大幅优于无监督基线方法,且在多数设置下SynCSE-partial甚至能达到与监督模型相当的性能。