Scaling Laws for Embedding Dimension in Information Retrieval

Dense retrieval, which encodes queries and documents into a single dense vector, has become the dominant neural retrieval approach due to its simplicity and compatibility with fast approximate nearest neighbor algorithms. As the tasks dense retrieval performs grow in complexity, the fundamental limitations of the underlying data structure and similarity metric -- namely vectors and inner-products -- become more apparent. Prior recent work has shown theoretical limitations inherent to single vectors and inner-products that are generally tied to the embedding dimension. Given the importance of embedding dimension for retrieval capacity, understanding how dense retrieval performance changes as embedding dimension is scaled is fundamental to building next generation retrieval models that balance effectiveness and efficiency. In this work, we conduct a comprehensive analysis of the relationship between embedding dimension and retrieval performance. Our experiments include two model families and a range of model sizes from each to construct a detailed picture of embedding scaling behavior. We find that the scaling behavior fits a power law, allowing us to derive scaling laws for performance given only embedding dimension, as well as a joint law accounting for embedding dimension and model size. Our analysis shows that for evaluation tasks aligned with the training task, performance continues to improve as embedding size increases, though with diminishing returns. For evaluation data that is less aligned with the training task, we find that performance is less predictable, with performance degrading with larger embedding dimensions for certain tasks. We hope our work provides additional insight into the limitations of embeddings and their behavior as well as offers a practical guide for selecting model and embedding dimension to achieve optimal performance with reduced storage and compute costs.

翻译：稠密检索通过将查询和文档编码为单个稠密向量，因其简洁性以及与快速近似最近邻算法的兼容性，已成为主流的神经检索方法。随着稠密检索处理任务的复杂性日益增加，其底层数据结构与相似性度量（即向量与内积）的根本局限性愈发明显。近期研究揭示了单向量与内积固有的理论局限，这些局限通常与嵌入维度密切相关。鉴于嵌入维度对检索能力的重要性，理解稠密检索性能如何随嵌入维度缩放而变化，对于构建兼顾效能与效率的下一代检索模型具有基础意义。本研究对嵌入维度与检索性能之间的关系进行了全面分析。我们的实验涵盖两个模型系列，并从每个系列中选取不同规模的模型，以构建嵌入缩放行为的详细图景。研究发现缩放行为符合幂律规律，使我们能够仅基于嵌入维度推导出性能的缩放规律，以及同时考虑嵌入维度与模型规模的联合规律。分析表明，对于与训练任务对齐的评估任务，性能随嵌入维度增加持续提升，但存在收益递减现象。对于与训练任务对齐度较低的评估数据，我们发现性能的可预测性降低，在某些任务中较大嵌入维度甚至会导致性能下降。本研究期望为嵌入的局限性及其行为提供新的见解，并为选择模型与嵌入维度以实现最优性能同时降低存储与计算成本提供实用指南。