On Strengths and Limitations of Single-Vector Embeddings

Recent work (Weller et al., 2025) introduced a naturalistic dataset called LIMIT and showed empirically that a wide range of popular single-vector embedding models suffer substantial drops in retrieval quality, raising concerns about the reliability of single-vector embeddings for retrieval. Although (Weller et al., 2025) proposed limited dimensionality as the main factor contributing to this, we show that dimensionality alone cannot explain the observed failures. We observe from results in (Alon et al., 2016) that $2k+1$-dimensional vector embeddings suffice for top-$k$ retrieval. This result points to other drivers of poor performance. Controlling for tokenization artifacts and linguistic similarity between attributes yields only modest gains. In contrast, we find that domain shift and misalignment between embedding similarities and the task's underlying notion of relevance are major contributors; finetuning mitigates these effects and can improve recall substantially. Even with finetuning, however, single-vector models remain markedly weaker than multi-vector representations, pointing to fundamental limitations. Moreover, finetuning single-vector models on LIMIT-like datasets leads to catastrophic forgetting (performance on MSMARCO drops by more than 40%), whereas forgetting for multi-vector models is minimal. To better understand the gap between performance of single-vector and multi-vector models, we study the drowning in documents paradox (Reimers \& Gurevych, 2021; Jacob et al., 2025): as the corpus grows, relevant documents are increasingly "drowned out" because embedding similarities behave, in part, like noisy statistical proxies for relevance. Through experiments and mathematical calculations on toy mathematical models, we illustrate why single-vector models are more susceptible to drowning effects compared to multi-vector models.

翻译：近期研究（Weller等，2025）提出了名为LIMIT的自然数据集，并通过实证表明，广泛使用的单向量嵌入模型在检索质量上出现显著下降，引发了对单向量嵌入检索可靠性的担忧。尽管（Weller等，2025）将有限维度作为主要因素归因于此，但我们证明仅凭维度无法解释观察到的失效现象。我们从（Alon等，2016）的研究结果中发现，$2k+1$维向量嵌入已足以支撑top-$k$检索。这一结论揭示了导致性能低下的其他驱动因素。控制分词伪影和属性间的语言相似性仅带来微小的性能提升。相反，我们发现领域偏移以及嵌入相似度与任务潜在相关性概念之间的失配是主要因素；微调可缓解这些影响，并显著提升召回率。然而，即便经过微调，单向量模型的表现仍明显弱于多向量表征，这揭示了其根本性局限。此外，在类似LIMIT的数据集上微调单向量模型会导致灾难性遗忘（MSMARCO上的性能下降超过40%），而多向量模型的遗忘则极小。为深入理解单向量与多向量模型之间的性能差距，我们研究了文档淹没悖论（Reimers & Gurevych, 2021; Jacob等，2025）：随着语料库规模扩大，相关文档会逐渐被"淹没"，因为嵌入相似度在一定程度上表现为相关性的噪声统计代理。通过实验及对玩具数学模型的数学计算，我们阐释了为何单向量模型比多向量模型更易受到淹没效应的影响。