Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model

Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with Contrastive learning of Sentence Embeddings (CSE) as the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, even when their sentence encoder and loss function are the same. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, alignment and uniformity only measure the results, which means they cannot answer "What happens during the training process that leads to the performance gap?" and "How can the performance gap be narrowed?". In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and unsupervised CSE during their respective training processes. From the comparison, We observe a significant difference in fitting difficulty. Thus, we introduce a metric, called Fitting Difficulty Increment (FDI), to measure the fitting difficulty gap between the evaluation dataset and the held-out training dataset, and use the metric to answer the "What" question. Then, based on the insights gained from the "What" question, we tackle the "How" question by increasing the fitting difficulty of the training dataset. We achieve this by leveraging the In-Context Learning (ICL) capability of the Large Language Model (LLM) to generate data that simulates complex patterns. By utilizing the hierarchical patterns in the LLM-generated data, we effectively narrow the gap between supervised and unsupervised CSE.

翻译：句子表征学习是自然语言处理中的基础任务，其中基于对比学习的句子嵌入因其卓越性能成为主流技术。在对比句子嵌入中存在一个有趣现象：即使使用相同的句子编码器和损失函数，有监督方法与无监督方法之间仍存在显著性能差距。以往研究将这种性能差距归因于两种表征特性（对齐性与均匀性）的差异。然而，对齐性与均匀性仅能度量结果，无法回答"训练过程中发生了什么导致性能差距"以及"如何缩小这种差距"这两个问题。本文通过实证实验解答这些"是什么"与"如何做"的问题。我们首先通过全面比较有监督与无监督对比句子嵌入在各自训练过程中的行为，回答"是什么"的问题。通过对比，我们观察到拟合难度存在显著差异。为此，我们提出一种名为拟合难度增量（FDI）的指标，用于衡量评估数据集与保留训练数据集之间的拟合难度差异，并借助该指标回答"是什么"的问题。随后，基于"是什么"问题获得的见解，我们通过提升训练数据集的拟合难度来应对"如何做"的问题。具体实现中，我们利用大语言模型的上下文学习能力生成模拟复杂模式的数据。通过运用大语言模型生成数据中的层次化模式，我们有效缩小了有监督与无监督对比句子嵌入之间的差距。