Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training base sentence embedding models into three stages (i.e., sentence generation, sentence pair construction, in-batch training) and refines the generated content at these three distinct stages, ensuring only high-quality sentence pairs are utilized to train a base contrastive learning model. Our extensive experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results. Comprehensive analyses further underscore the potential of our framework in various application scenarios and achieving better sentence representation learning with LLMs.
翻译:近期,大语言模型(LLMs)作为颠覆性技术崭露头角,其无与伦比的文本生成能力引发了将其应用于基础句子表征学习任务的兴趣。现有方法探索利用LLMs作为数据标注器,生成用于训练基于对比学习的句子嵌入模型(如SimCSE)的合成数据。然而,由于对比学习模型对句子对质量高度敏感,这些方法的有效性很大程度上受LLMs生成内容的影响,凸显了在句子表征学习场景中需要更精细的生成范式。基于这一前提,我们提出MultiCSR——一种多层级对比句子表征学习框架,它将通过提示LLMs生成训练基础句子嵌入模型的语料库这一过程分解为三个阶段(即句子生成、句子对构建、批次内训练),并在三个不同阶段对生成内容进行优化,确保仅利用高质量的句子对训练基础对比学习模型。大量实验表明,MultiCSR能使较弱的LLM超越ChatGPT的性能,而将其应用于ChatGPT则取得了更优的最先进结果。综合分析进一步凸显了该框架在多应用场景中的潜力,以及通过LLMs实现更优句子表征学习的可能性。