Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the embedding LLMs, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight embedding LLMs and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we then find that the main change in embedding space between the embedding LLMs and their original generative LLMs is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80\% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a fresh perspective to help understand fuzzy concepts (e.g., semantic relatedness vs. semantic similarity) and emerging technologies (e.g., instruction-following embedding) in this field.
翻译:大语言模型(LLMs)生成的文本嵌入在信息检索、语义文本相似度等任务中已取得优异性能。本研究发现一个有趣现象:当将文本输入至用于生成嵌入的LLMs时,所得文本嵌入能够与输入文本中的关键令牌实现对齐。我们首先在八种嵌入LLMs上全面分析了该现象,证明其具有普适性,且不受模型架构、训练策略和嵌入方法的影响。通过深入分析,我们发现嵌入LLMs与其原始生成式LLMs在嵌入空间中的主要差异集中于第一主成分。通过调整第一主成分,即可实现文本嵌入与关键令牌的对齐。最后,我们通过多个示例展示了该发现的广泛应用潜力:(1)提出基于对齐令牌的简洁实用稀疏检索方法,在显著降低计算量的同时,能达到同模型稠密检索效果80%的性能;(2)研究表明,该发现为理解该领域模糊概念(如语义关联性与语义相似性)与新兴技术(如指令跟随嵌入)提供了全新视角。