LeanVec: Search your vectors faster by making them fit

Modern deep learning models have the ability to generate high-dimensional vectors whose similarity reflects semantic resemblance. Thus, similarity search, i.e., the operation of retrieving those vectors in a large collection that are similar to a given query, has become a critical component of a wide range of applications that demand highly accurate and timely answers. In this setting, the high vector dimensionality puts similarity search systems under compute and memory pressure, leading to subpar performance. Additionally, cross-modal retrieval tasks have become increasingly common, e.g., where a user inputs a text query to find the most relevant images for that query. However, these queries often have different distributions than the database embeddings, making it challenging to achieve high accuracy. In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy. We present LeanVec variants for in-distribution (ID) and out-of-distribution (OOD) queries. LeanVec-ID yields accuracies on par with those from recently introduced deep learning alternatives whose computational overhead precludes their usage in practice. LeanVec-OOD uses a novel technique for dimensionality reduction that considers the query and database distributions to simultaneously boost the accuracy and the performance of the framework even further (even presenting competitive results when the query and database distributions match). All in all, our extensive and varied experimental results show that LeanVec produces state-of-the-art results, with up to 3.7x improvement in search throughput and up to 4.9x faster index build time over the state of the art.

翻译：现代深度学习模型能够生成高维向量，其相似性反映语义相近程度。因此，相似性搜索——即从大规模数据集中检索与给定查询相似的向量——已成为需要高精度与快速响应的各类应用中的关键组件。在此背景下，高向量维度导致相似性搜索系统面临计算与内存压力，性能表现不佳。此外，跨模态检索任务日益普遍，例如用户输入文本查询以寻找最相关的图像。然而，这些查询的分布往往与数据库嵌入不同，使得实现高精度颇具挑战性。本研究提出LeanVec框架，该框架结合线性降维与向量量化，在保持精度的同时加速高维向量的相似性搜索。我们针对分布内与分布外查询分别提出LeanVec变体。LeanVec-ID实现的精度与近期提出的深度学习方法相当，而后者因计算开销过高无法实际应用。LeanVec-OOD采用一种考虑查询与数据库分布的新颖降维技术，进一步同时提升框架的精度与性能（即使查询与数据库分布匹配时，也能呈现具有竞争力的结果）。总而言之，我们广泛且多样的实验结果表明，LeanVec取得了最先进的成果：与现有技术相比，搜索吞吐量提升高达3.7倍，索引构建速度加快高达4.9倍。