Contrastive learning has been proven to be effective in learning better sentence representations. However, to train a contrastive learning model, large numbers of labeled sentences are required to construct positive and negative pairs explicitly, such as those in natural language inference (NLI) datasets. Unfortunately, acquiring sufficient high-quality labeled data can be both time-consuming and resource-intensive, leading researchers to focus on developing methods for learning unsupervised sentence representations. As there is no clear relationship between these unstructured randomly-sampled sentences, building positive and negative pairs over them is tricky and problematic. To tackle these challenges, in this paper, we propose SemCSR, a semantic-aware contrastive sentence representation framework. By leveraging the generation and evaluation capabilities of large language models (LLMs), we can automatically construct a high-quality NLI-style corpus without any human annotation, and further incorporate the generated sentence pairs into learning a contrastive sentence representation model. Extensive experiments and comprehensive analyses demonstrate the effectiveness of our proposed framework for learning a better sentence representation with LLMs.
翻译:对比学习已被证明在学习更好的句子表示方面是有效的。然而,要训练对比学习模型,需要大量标注句子来显式构建正负样本对,例如自然语言推理数据集中的句子。遗憾的是,获取足够的高质量标注数据既耗时又耗资源,这促使研究者专注于开发无监督句子表示学习方法。由于这些无结构随机采样的句子之间缺乏明确关系,基于它们构建正负样本对存在困难且问题重重。为解决这些挑战,本文提出SemCSR——一种语义感知的对比句子表示框架。通过利用大语言模型的生成与评估能力,我们无需任何人工标注即可自动构建高质量的自然语言推理风格语料库,并进一步将生成的句子对融入对比句子表示模型的学习过程。大量实验与全面分析证明了我们提出的框架在利用大语言模型学习更优句子表示方面的有效性。