Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the embedding LLMs, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight embedding LLMs and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we then find that the main change in embedding space between the embedding LLMs and their original generative LLMs is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80\% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a fresh perspective to help understand fuzzy concepts (e.g., semantic relatedness vs. semantic similarity) and emerging technologies (e.g., instruction-following embedding) in this field.
翻译:大语言模型(LLMs)生成的文本嵌入在信息检索、语义文本相似度等任务中已取得优异表现。本工作揭示了一个有趣的现象:当将文本输入至用于嵌入的LLMs时,所得文本嵌入能够与输入文本中的关键标记(key tokens)实现对齐。我们首先在八种不同的嵌入LLMs上全面分析了该现象,结果表明此现象具有普遍性,且不受模型架构、训练策略和嵌入方法的影响。通过深入分析,我们发现嵌入LLMs与其原始生成式LLMs在嵌入空间中的主要差异集中于第一主成分。通过调整第一主成分,即可实现文本嵌入与关键标记的对齐。最后,我们通过多个示例展示了这一发现的广阔应用潜力:(1)基于对齐标记提出了一种简洁实用的稀疏检索方法,在显著降低计算量的同时,能达到同模型稠密检索效果80%的性能;(2)我们的发现为理解该领域的模糊概念(如语义关联性与语义相似性)与新兴技术(如指令跟随嵌入)提供了新的视角。