Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks.
翻译:句子对相似度的判定对于多种自然语言处理任务至关重要。针对该问题的常见技术通常在0到5的连续语义文本相似度标尺上进行评估。然而,基于STS标注指南中的语言学观察,我们发现[4,5]分区间内的样本属于高分段样本,其余则为低分段样本。这要求我们采用新方法对高分段和低分段类别进行分别处理。本文提出一种名为MixSP的新型嵌入空间分解方法,该方法利用混合专业化投影器,旨在精准区分并排序高分段与低分段样本。实验结果表明,MixSP在显著降低高分段与低分段类别间重叠表征的同时,在STS及零样本基准测试中均优于现有方法。