Unsupervised sentence embeddings task aims to convert sentences to semantic vector representations. Most previous works directly use the sentence representations derived from pretrained language models. However, due to the token bias in pretrained language models, the models can not capture the fine-grained semantics in sentences, which leads to poor predictions. To address this issue, we propose a novel Self-Adaptive Reconstruction Contrastive Sentence Embeddings (SARCSE) framework, which reconstructs all tokens in sentences with an AutoEncoder to help the model to preserve more fine-grained semantics during tokens aggregating. In addition, we proposed a self-adaptive reconstruction loss to alleviate the token bias towards frequency. Experimental results show that SARCSE gains significant improvements compared with the strong baseline SimCSE on the 7 STS tasks.
翻译:无监督句子嵌入任务旨在将句子转换为语义向量表示。以往的大多数工作直接使用预训练语言模型产生的句子表示。然而,由于预训练语言模型中的标记偏差,模型无法捕捉句子中的细粒度语义,导致预测效果不佳。针对这一问题,我们提出了一种新颖的自适应重构对比句子嵌入(SARCSE)框架,该框架通过自编码器重构句子中的所有标记,帮助模型在标记聚合过程中保留更多的细粒度语义。此外,我们提出了一种自适应重构损失,以减轻标记对频率的偏差。实验结果表明,在7个STS任务上,SARCSE相较于强基线SimCSE取得了显著改进。