Retrieval-Augmented Generation (RAG) relies on large-scale Approximate Nearest Neighbor Search (ANNS) to retrieve semantically relevant context for large language models. Among ANNS methods, IVF-PQ offers an attractive balance between memory efficiency and search accuracy. However, achieving high recall requires reranking which fetches full-precision vectors for reranking, and the billion-scale vector databases need to reside in CPU DRAM or SSD due to the limited capacity of GPU HBM. This off-GPU data movement introduces substantial latency and throughput degradation. We propose HAVEN, a GPU architecture augmented with High-Bandwidth Flash (HBF) which is a recently introduced die-stacked 3D NAND technology engineered to deliver terabyte-scale capacity and hundreds of GB/s read bandwidth. By integrating HBF and near-storage search unit as an on-package complement to HBM, HAVEN enables the full-precision vector database to reside entirely on-device, eliminating PCIe and DDR bottlenecks during reranking. Through detailed modeling of re-architected 3D NAND subarrays, power-constrained HBF bandwidth, and end-to-end IVF-PQ pipelines, we demonstrate that HAVEN improves reranking throughput by up to 20x and latency up to 40x across billion-scale datasets compared to GPU-DRAM and GPU-SSD systems. Our results show that HBF-augmented GPUs enable high-recall retrieval at throughput previously achievable only without reranking, offering a promising direction for memory-centric AI accelerators.
翻译:检索增强生成(RAG)依赖于大规模近似最近邻搜索(ANNS)来为大型语言模型检索语义相关的上下文。在ANNS方法中,IVF-PQ在内存效率与搜索精度之间提供了理想的平衡。然而,实现高召回率需要进行重排序,这需要获取全精度向量以进行重排序,而十亿规模的向量数据库由于GPU HBM容量有限,必须驻留在CPU DRAM或SSD中。这种GPU外数据传输带来了显著的延迟和吞吐量下降。我们提出HAVEN,一种采用高带宽闪存(HBF)增强的GPU架构。HBF是近期推出的晶圆堆叠3D NAND技术,旨在提供太字节级容量和数百GB/s的读取带宽。通过将HBF与近存储搜索单元集成,作为HBM的封装内补充,HAVEN使得全精度向量数据库能够完全驻留在设备上,消除了重排序过程中的PCIe和DDR瓶颈。通过对重新设计的3D NAND子阵列、功耗约束下的HBF带宽以及端到端IVF-PQ流水线进行详细建模,我们证明,与GPU-DRAM和GPU-SSD系统相比,HAVEN在十亿规模数据集上将重排序吞吐量提升高达20倍,延迟降低高达40倍。我们的结果表明,HBF增强的GPU能够以先前仅在不进行重排序时才能实现的吞吐量完成高召回率检索,为以内存为中心的AI加速器提供了一个有前景的发展方向。