In this work, we address the problem of cardinality estimation for similarity search in high-dimensional spaces. Our goal is to design a framework that is lightweight, easy to construct, and capable of providing accurate estimates with satisfying online efficiency. We leverage locality-sensitive hashing (LSH) to partition the vector space while preserving distance proximity. Building on this, we adopt the principles of classical multi-probe LSH to adaptively explore neighboring buckets, accounting for distance thresholds of varying magnitudes. To improve online efficiency, we employ progressive sampling to reduce the number of distance computations and utilize asymmetric distance computation in product quantization to accelerate distance calculations in high-dimensional spaces. In addition to handling static datasets, our framework includes updating algorithm designed to efficiently support large-scale dynamic scenarios of data updates.Experiments demonstrate that our methods can accurately estimate the cardinality of similarity queries, yielding satisfying efficiency.
翻译:本文针对高维空间中相似搜索的基数估计问题展开研究。我们的目标是设计一种轻量级、易于构建且能提供准确估计与满意在线效率的框架。利用局部敏感哈希(LSH)对向量空间进行划分,同时保持距离邻近性。在此基础上,借鉴经典多探针LSH的原理,自适应地探索相邻桶,以应对不同量级的距离阈值。为提升在线效率,采用渐进采样减少距离计算次数,并利用乘积量化中的非对称距离计算加速高维空间中的距离运算。除处理静态数据集外,本框架还包含更新算法,以高效支持大规模动态数据更新场景。实验表明,我们的方法能够准确估计相似查询的基数,并展现出令人满意的效率。