As large language models (LLMs) continue to advance, retrieval-augmented generation (RAG) has become the key mechanism for expanding model knowledge and reducing hallucinations. Central to RAG is approximate nearest neighbor search (ANNS), which retrieves database vectors most similar to a given query. However, distance calculation over high-dimensional vectors is inherently memory-bound, causing retrieval performance to be constrained by I/O bandwidth on mainstream platforms such as CPUs and GPUs. Although many prior early exiting (EE) techniques attempt to reduce memory accesses by only computing partial dimensions, the partial distance converges too slowly to the EE threshold, which ultimately limits their performance gains. To address these challenges, we propose NASZIP, a hardware-software co-designed framework that integrates near data processing (NDP) with a novel feature-level early exiting guided by statistics-based principal component analysis (PCA). Instead of relying solely on partial distances, NASZIP incorporates estimation and correction parameters to approximate full dimensional distances accurately, enabling earlier exiting without compromising accuracy. We further introduce a bit-level NDP-aware dynamic-float scheme that significantly reduces memory access for vector data. On the hardware side, we develop a data aware neighbor list mapping strategy that reduces neighbor retrieval latency and inter-channel communication overhead, complemented by a dedicated cache that exploits data locality and enhances prefetch efficiency. With these co-optimized techniques, NASZIP delivers speedups of up to $8.4\times$ / $1.4\times$ over CPU baseline and state-of-the-art GPU implementation at equal accuracy. Relative to the state-of-the-art NDP ANNS accelerator ANSMET, NASZIP achieves $1.69\times$ performance improvement.
翻译:摘要:随着大型语言模型(LLMs)的持续发展,检索增强生成(RAG)已成为扩展模型知识并减少幻觉的关键机制。RAG的核心在于近似最近邻搜索(ANNS),它负责检索与给定查询最相似的数据库向量。然而,高维向量的距离计算本质上是内存受限的,导致在CPU和GPU等主流平台上,检索性能受限于I/O带宽。尽管许多先前的早退(EE)技术试图通过仅计算部分维度来减少内存访问,但部分距离收敛到EE阈值的过程过于缓慢,最终限制了其性能提升。为应对这些挑战,我们提出了NASZIP——一种软硬件协同设计的框架,它将近数据处理(NDP)与基于统计主成分分析(PCA)的特征级早退机制相结合。NASZIP不单纯依赖部分距离,而是引入估计和修正参数来准确逼近全维度距离,从而在不牺牲精度的前提下实现更早的退出。我们进一步提出了一种针对NDP的位级动态浮点方案,显著减少了向量数据的内存访问。在硬件方面,我们开发了一种数据感知的邻域列表映射策略,可降低邻域检索延迟和通道间通信开销,并辅以专用缓存以利用数据局部性并提升预取效率。通过协同优化的技术,NASZIP在相同精度下相比CPU基线及最先进GPU实现可获得高达$8.4\times$/$1.4\times$的加速。相对于最先进的NDP ANNS加速器ANSMET,NASZIP实现了$1.69\times$的性能提升。