Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.
翻译:大型语言模型在广泛的下游任务中展现出令人印象深刻的零样本能力。然而,它们难以作为现成的嵌入模型直接使用,导致在大型文本嵌入基准测试中表现欠佳。本文识别出导致这一缺陷的潜在原因。我们的动机源于一个意外的观察:当文本嵌入被投影到词汇空间时,它们倾向于与高频但信息量低的词元对齐。我们认为,这种对高频词元的过度表达抑制了模型捕捉细微语义的能力。为解决此问题,我们提出EmbedFilter,一种简单的线性变换方法,旨在直接精炼从大型语言模型中提取的文本嵌入。具体而言,我们揭示了大型语言模型内部的解嵌入矩阵编码了一个潜在空间,该空间主动将这些高频词元写入嵌入空间。通过过滤该子空间,EmbedFilter抑制了高频词元的影响,从而增强语义表示。一个引人注目的副产品是,这实现了固有的维度缩减:在降低索引存储并加速检索的同时,完全保持精炼后的嵌入质量。我们在多个大型语言模型骨干上的实验表明,即使嵌入维度大幅降低,配备EmbedFilter的模型也能在零样本下游任务中取得优越性能。我们希望我们的发现能为基于大型语言模型的表示机制提供更深刻的见解,并启发更多原则性设计以改进文本嵌入训练。我们的代码已在https://github.com/CentreChen/EmbFilter 开源。