Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.
翻译:基于Transformer的嵌入模型依赖池化操作将变长文本映射为单一向量,从而支持高效的相似性搜索,但同时也引发了诸如各向异性和长度导致的嵌入坍缩等典型几何病理现象。现有研究主要描述这些病理现象的表征,却对其何时及为何损害下游检索任务缺乏深入理解。本文提出,缺失的关键因果因素是"语义偏移":文本内部语义固有的结构化演进与离散化过程。我们首先对Transformer嵌入中的"语义平滑"进行理论分析:随着构成句子间语义多样性的增加,池化后的表示必然偏离每个独立句子的嵌入向量,形成平滑且区分度降低的向量表示。在此基础上,我们将语义偏移形式化为可计算度量,整合了局部语义演化与全局语义离散两个维度。通过跨语料库与多嵌入模型的对比实验证明,语义偏移与嵌入集中化程度高度相关,并能有效预测检索性能退化,而单纯的文本长度则不具备此预测力。总体而言,语义偏移为理解嵌入坍缩现象及诊断各向异性何时具有危害性,提供了统一且可操作的视角。