Transformer-based embedding models frequently exhibit geometric pathologies, such as anisotropy and length-induced representation collapse, which can degrade downstream retrieval performance. While prior work often attributes these issues directly to text length or attention mechanisms, we argue that the fundamental drivers are instead the inherent pooling operations coupled with internal semantic shift. In this paper, we establish a unified theoretical framework proving that contextual pooling intrinsically causes embedding collapse. Specifically, we mathematically prove that pooling semantically diverse sentences inevitably leads to micro-level semantic dilution, and strictly reduces the Mean Pairwise Distance of the vector space, guaranteeing macro-level spatial concentration. Grounded in these geometric insights, we formally define semantic shift to capture the natural semantic evolution and dispersion within a text. Through carefully controlled experiments across diverse models and corpora, we disentangle text length from semantic content. We demonstrate that semantic shift is the primary predictor of severe embedding concentration. Crucially, our retrieval evaluations reveal that anisotropy is fundamentally harmful only when induced by strong semantic shifts, reconciling conflicting observations in prior literature and offering a principled explanation for the long-context challenges faced by modern embedding models.
翻译:基于Transformer的嵌入模型常表现出几何缺陷,如各向异性及长度驱动的表示坍缩,这些缺陷会降低下游检索性能。尽管既往研究通常将这些问题的成因直接归于文本长度或注意力机制,但我们认为其根本驱动因素实为固有的池化操作与内部语义偏移的耦合作用。本文建立了统一的理论框架,从数学上证明了上下文池化本质上会导致嵌入坍缩。具体而言,我们通过数学证明:对语义多样的句子进行池化操作必然引发微观层面的语义稀释,并严格降低向量空间的平均成对距离,从而保证宏观层面的空间集中性。基于这些几何洞察,我们正式定义了语义偏移以刻画文本内自然的语义演化与离散现象。通过跨模型、跨语料库的精细控制实验,我们解耦了文本长度与语义内容,证明语义偏移是预测严重嵌入集中的首要指标。尤为关键的是,我们的检索评估表明:各向异性仅在由强语义偏移诱发时才会产生根本性危害——这一发现调和了已有文献中的矛盾观点,为现代嵌入模型面临的长文本挑战提供了原理性解释。