As the dimensionality of modern learned representations increases to thousands of dimensions, the state-of-the-art Approximate Nearest Neighbor (ANN) indices exhibit severe limitations. Graph-based methods (e.g., HNSW) suffer from prohibitive memory consumption and routing degradation, while recent randomized quantization and learned rotation approaches (e.g., RaBitQ, OPQ) impose significant preprocessing overheads. We introduce CRISP, a novel framework designed for ANN search in very-high-dimensional spaces. Unlike rigid pipelines that apply expensive orthogonal rotations indiscriminately, CRISP employs a lightweight, correlation- aware adaptive strategy that redistributes variance only when necessary, effectively reducing the preprocessing complexity. We couple this adaptive mechanism with a cache-coherent Compressed Sparse Row (CSR) index structure. Furthermore, CRISP incorporates a multi-stage dual-mode query engine: a Guaranteed Mode that preserves rigorous theoretical lower bounds on recall, and an Optimized Mode that leverages rank-based weighted scoring and early termination to reduce query latency. Extensive evaluation on datasets of very high dimensionality (up to 4096) demonstrates that CRISP achieves state-of-the-art query throughput, low construction costs, and peak memory efficiency.
翻译:随着现代学习表示维度增至数千维,现有近似最近邻(ANN)索引方法面临严重局限。基于图的方法(如HNSW)存在内存消耗过高和路由退化问题,而近期随机量化与学习旋转方法(如RaBitQ、OPQ)则带来显著预处理开销。本文提出CRISP——一种面向超高维空间ANN搜索的新型框架。与盲目施加昂贵正交旋转的刚性流程不同,CRISP采用轻量级、感知相关性的自适应策略,仅在必要时重新分配方差,从而有效降低预处理复杂度。该自适应机制与缓存优化的压缩稀疏行(CSR)索引结构协同工作。此外,CRISP集成多阶段双模查询引擎:保障模式维持召回率的严格理论下界,优化模式则利用基于排序的加权评分与提前终止机制降低查询延迟。在超高维数据集(最高达4096维)上的大量实验表明,CRISP实现了最优的查询吞吐量、低构建成本与峰值内存效率。