Contrastive learning-based methods, such as unsup-SimCSE, have achieved state-of-the-art (SOTA) performances in learning unsupervised sentence embeddings. However, in previous studies, each embedding used for contrastive learning only derived from one sentence instance, and we call these embeddings instance-level embeddings. In other words, each embedding is regarded as a unique class of its own, whichmay hurt the generalization performance. In this study, we propose IS-CSE (instance smoothing contrastive sentence embedding) to smooth the boundaries of embeddings in the feature space. Specifically, we retrieve embeddings from a dynamic memory buffer according to the semantic similarity to get a positive embedding group. Then embeddings in the group are aggregated by a self-attention operation to produce a smoothed instance embedding for further analysis. We evaluate our method on standard semantic text similarity (STS) tasks and achieve an average of 78.30%, 79.47%, 77.73%, and 79.42% Spearman's correlation on the base of BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large respectively, a 2.05%, 1.06%, 1.16% and 0.52% improvement compared to unsup-SimCSE.
翻译:基于对比学习的方法(如unsup-SimCSE)在无监督句子嵌入学习任务中已取得最先进(SOTA)性能。然而,在以往研究中,每个用于对比学习的嵌入仅源于单个句子实例,我们称此类嵌入为实例级嵌入。换言之,每个嵌入都被视为独立的唯一类别,这可能会损害模型的泛化性能。本研究提出IS-CSE(实例平滑对比句子嵌入)方法,旨在通过平滑特征空间中嵌入的边界来改善性能。具体而言,我们根据语义相似性从动态记忆缓冲区中检索嵌入,获取正向嵌入组,随后通过自注意力机制聚合组内嵌入,生成平滑后的实例嵌入用于后续分析。我们在标准语义文本相似度(STS)任务上评估了该方法,基于BERT-base、BERT-large、RoBERTa-base和RoBERTa-large分别实现了78.30%、79.47%、77.73%和79.42%的平均斯皮尔曼相关系数,较unsup-SimCSE分别提升了2.05%、1.06%、1.16%和0.52%。