Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

Fujun He,Chuyue Ye,Huaxiang Cai,Zetao Lv,Baolong Cui,Wenru Yan,Chao Zhan,Zigang Zhang,Hao Yi,Jie Xiang,Xiabing Li,Yuhang Gai,Ziyang Zhang,Pengfei Zheng,Yunfei Du

Vector similarity search is a critical component of modern AI systems, but traditional CPU-based implementations face fundamental scalability bottlenecks for billion-scale corpora due to prohibitive computational overhead and memory bandwidth limitations. While Neural Processing Units (NPUs) offer orders-of-magnitude higher compute density, existing CPU/GPU-optimized 1-bit RaBitQ quantization implementations cannot be directly ported to NPU architectures due to fundamental hardware mismatches, and homogeneous design paradigms struggle to simultaneously balance accuracy, memory footprint, and performance. This paper presents Ascend-RaBitQ, the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search, built on the core insight that decoupling coarse ranking (NPU) from fine ranking (CPU) allows each stage to leverage its optimal hardware, breaking the long-standing accuracy-memory-performance trade-off. We propose a three-stage heterogeneous execution path comprising AI Core-accelerated coarse ranking on 1-bit quantized vectors, on-device AI CPU Top-k processing, and host CPU fine re-ranking on full-precision vectors. We introduce four NPU architecture-native optimizations: fused AIC-AIV operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing that breaks query boundaries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask Top-k latency. Evaluation on standard datasets shows that Ascend-RaBitQ achieves 3.0X to 62.8X faster index construction than the CPU baseline, up to 11.7X throughput improvement over the fastest CPU IVF-RaBitQ implementation, and over two orders of magnitude over the mathematically equivalent CPU baseline, while demonstrating encouraging scalability on distributed multi-NPU systems.

翻译：向量相似性搜索是现代AI系统的关键组成部分，但传统基于CPU的实现因计算开销过大和内存带宽限制，在面对十亿级语料库时存在根本性可扩展瓶颈。尽管神经处理单元（NPU）的计算密度高出数个数量级，但现有针对CPU/GPU优化的1比特RaBitQ量化实现因硬件根本性不匹配而无法直接移植到NPU架构，且同构设计范式难以同时平衡精度、内存占用和性能。本文提出Ascend-RaBitQ——首个针对十亿级向量搜索优化的异构NPU-CPU IVF-RaBitQ系统，其核心思路在于将粗排（NPU）与精排（CPU）解耦，使各阶段充分利用其最优硬件，从而打破长期存在的精度-内存-性能权衡困境。我们提出包含三阶段异构执行路径的方案：基于1比特量化向量的AI Core加速粗排、设备端AI CPU的Top-k处理，以及基于全精度向量的主机CPU精细重排序。我们引入四项NPU架构原生优化技术：用于并行距离计算的融合AIC-AIV算子、利用旋转正交性的计算流重构、突破查询边界的细粒度索引块级负载均衡，以及AI Core与AI CPU之间的NPU内部流水线并行以掩盖Top-k延迟。在标准数据集上的评估表明，Ascend-RaBitQ的索引构建速度较CPU基线提升3.0倍至62.8倍，吞吐量较最快CPU IVF-RaBitQ实现最高提升11.7倍，较数学等价的CPU基线提升超过两个数量级，并在分布式多NPU系统上展现出优异的可扩展性。