A measure of similarity between text embeddings can be considered adequate only if it adheres to the human perception of similarity between texts. In this paper, we introduce the distance-to-distance ratio (DDR), a novel measure of similarity between LLM sentence embeddings. Inspired by Lipschitz continuity, DDR measures the rate of change in similarity between the pre-context word embeddings and the similarity between post-context LLM embeddings, thus measuring the semantic influence of context. We evaluate the performance of DDR in experiments designed as a series of perturbations applied to sentences drawn from a sentence dataset. For each sentence, we generate variants by replacing one, two, or three words with either synonyms, which constitute semantically similar text, or randomly chosen words, which constitute semantically dissimilar text. We compare the performance of DDR with other prevailing similarity metrics and demonstrate that DDR consistently provides finer discrimination between semantically similar and dissimilar texts, even under minimal, controlled edits.
翻译:文本嵌入之间的相似性度量只有在符合人类对文本相似性感知的情况下才能被认为是充分的。本文提出距离-距离比率(DDR),这是一种新颖的大语言模型句子嵌入相似性度量方法。受利普希茨连续性启发,DDR通过比较前语境词嵌入的相似性与后语境大语言模型嵌入的相似性之间的变化率,从而量化语境的语义影响。我们通过一系列针对句子数据集设计的扰动实验评估DDR的性能:对每个原句,通过替换一至三个词汇生成变体——使用同义词构成语义相似文本,或使用随机词汇构成语义不相似文本。通过将DDR与主流相似性度量方法进行对比,我们证明即使在最小化受控编辑条件下,DDR仍能持续提供语义相似与不相似文本之间更精细的区分能力。