Unsupervised sentence representation learning has progressed through contrastive learning and data augmentation methods such as dropout masking. Despite this progress, sentence encoders are still limited to using only an input sentence when predicting its semantic vector. In this work, we show that the semantic meaning of a sentence is also determined by nearest-neighbor sentences that are similar to the input sentence. Based on this finding, we propose a novel unsupervised sentence encoder, RankEncoder. RankEncoder predicts the semantic vector of an input sentence by leveraging its relationship with other sentences in an external corpus, as well as the input sentence itself. We evaluate RankEncoder on semantic textual benchmark datasets. From the experimental results, we verify that 1) RankEncoder achieves 80.07% Spearman's correlation, a 1.1% absolute improvement compared to the previous state-of-the-art performance, 2) RankEncoder is universally applicable to existing unsupervised sentence embedding methods, and 3) RankEncoder is specifically effective for predicting the similarity scores of similar sentence pairs.
翻译:无监督句子表示学习通过对比学习和数据增强方法(如dropout掩码)取得了进展。尽管取得了这些进展,句子编码器在预测语义向量时仍仅限于使用输入句子本身。在这项工作中,我们证明了一个句子的语义含义还由与输入句子相似的最近邻句子决定。基于这一发现,我们提出了一种新颖的无监督句子编码器RankEncoder。RankEncoder通过利用输入句子与外部语料库中其他句子的关系以及输入句子本身来预测其语义向量。我们在语义文本基准数据集上评估了RankEncoder。从实验结果中,我们验证了:1)RankEncoder达到了80.07%的斯皮尔曼相关系数,相比先前的最优性能绝对提升了1.1%;2)RankEncoder普遍适用于现有的无监督句子嵌入方法;3)RankEncoder在预测相似句子对的相似度分数方面尤为有效。