In this paper, we adopted a retrospective approach to examine and compare five existing popular sentence encoders, i.e., Sentence-BERT, Universal Sentence Encoder (USE), LASER, InferSent, and Doc2vec, in terms of their performance on downstream tasks versus their capability to capture basic semantic properties. Initially, we evaluated all five sentence encoders on the popular SentEval benchmark and found that multiple sentence encoders perform quite well on a variety of popular downstream tasks. However, being unable to find a single winner in all cases, we designed further experiments to gain a deeper understanding of their behavior. Specifically, we proposed four semantic evaluation criteria, i.e., Paraphrasing, Synonym Replacement, Antonym Replacement, and Sentence Jumbling, and evaluated the same five sentence encoders using these criteria. We found that the Sentence-Bert and USE models pass the paraphrasing criterion, with SBERT being the superior between the two. LASER dominates in the case of the synonym replacement criterion. Interestingly, all the sentence encoders failed the antonym replacement and jumbling criteria. These results suggest that although these popular sentence encoders perform quite well on the SentEval benchmark, they still struggle to capture some basic semantic properties, thus, posing a daunting dilemma in NLP research.
翻译:本文采用回顾性方法,对五种现有主流句编码器(即Sentence-BERT、Universal Sentence Encoder (USE)、LASER、InferSent和Doc2vec)在下游任务中的表现及其捕捉基本语义属性的能力进行了比较研究。首先,我们在流行的SentEval基准上评估了所有五种句编码器,发现多个句编码器在各类常见下游任务中表现良好。然而,由于未能在所有情况下找到单一最优模型,我们设计了进一步实验以更深入理解其行为特征。具体而言,我们提出了四项语义评估准则(即同义转述、同义词替换、反义词替换和句子打乱),并据此对相同的五种句编码器进行评估。实验结果表明:Sentence-BERT和USE模型通过了同义转述准则测试,其中SBERT表现更优;LASER在同义词替换准则测试中表现最佳;有趣的是,所有句编码器均未通过反义词替换和句子打乱准则测试。这些结果说明,尽管这些主流句编码器在SentEval基准上表现优异,但它们仍难以捕捉某些基本语义属性,从而在自然语言处理研究中构成了一个严峻的困境。