LeanVec: Searching vectors faster by making them fit

Modern deep learning models have the ability to generate high-dimensional vectors whose similarity reflects semantic resemblance. Thus, similarity search, i.e., the operation of retrieving those vectors in a large collection that are similar to a given query, has become a critical component of a wide range of applications that demand highly accurate and timely answers. In this setting, the high vector dimensionality puts similarity search systems under compute and memory pressure, leading to subpar performance. Additionally, cross-modal retrieval tasks have become increasingly common, e.g., where a user inputs a text query to find the most relevant images for that query. However, these queries often have different distributions than the database embeddings, making it challenging to achieve high accuracy. In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy. We present LeanVec variants for in-distribution (ID) and out-of-distribution (OOD) queries. LeanVec-ID yields accuracies on par with those from recently introduced deep learning alternatives whose computational overhead precludes their usage in practice. LeanVec-OOD uses two novel techniques for dimensionality reduction that consider the query and database distributions to simultaneously boost the accuracy and the performance of the framework even further (even presenting competitive results when the query and database distributions match). All in all, our extensive and varied experimental results show that LeanVec produces state-of-the-art results, with up to 3.7x improvement in search throughput and up to 4.9x faster index build time over the state of the art.

翻译：现代深度学习模型能够生成高维向量，其相似性反映语义关联。因此，相似性搜索（即从大规模向量集合中检索与给定查询相似的向量）已成为众多需要高精度和高时效响应的应用的关键组成部分。在此场景下，高向量维度给相似性搜索系统带来了计算和存储压力，导致性能欠佳。此外，跨模态检索任务日益普遍，例如用户输入文本查询以寻找最相关的图像。然而，这些查询的分布通常与数据库嵌入的分布不同，使得实现高精度成为挑战。本文提出LeanVec框架，该框架结合线性降维与向量量化，在保持精度的同时加速高维向量的相似性搜索。我们针对分布内（ID）和分布外（OOD）查询分别提出LeanVec变体。LeanVec-ID的精度与近期引入的深度学习替代方案相当，但后者因计算开销过大而无法实际应用。LeanVec-OOD采用两种新型降维技术，结合查询与数据库分布信息，进一步提升了框架的精度和性能（即使查询与数据库分布匹配时也展现出竞争性结果）。总之，广泛多样的实验结果表明，LeanVec实现了最先进水平，搜索吞吐量提升高达3.7倍，索引构建速度相比当前最优方法提升高达4.9倍。