Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
翻译:自然语言文本之间的语义相似性通常通过两种方式度量:一种是考察子序列的重叠程度(例如BLEU),另一种是使用向量嵌入(例如BERTScore、S-BERT)。在本文中,我们认为,当仅关注语义相似性的度量时,直接使用针对此类任务微调的模型来预测相似性更为有效。通过采用GLUE基准测试中针对STS-B任务微调的模型,我们定义了STSScore方法,并证明该方法所产生的相似性结果比其它方法更符合我们对鲁棒性语义相似性度量的预期。