Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
翻译:自然语言文本之间的语义相似度通常通过子序列重叠(如BLEU)或使用嵌入(如BERTScore、S-BERT)来衡量。本文认为,当我们仅关注语义相似度的测量时,最好直接使用针对此类任务微调的模型来预测相似度。我们采用GLUE基准中语义文本相似度基准任务(STS-B)微调模型,定义了STSScore方法,并证明该方法得到的相似度比其他方法更能符合我们对稳健语义相似度度量的预期。