An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4-27.9x faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7-26.3x lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM, which is the most expensive component in today's servers, from being stranded.
翻译:解决大型语言模型(LLMs)中的幻觉问题并提升其准确性的一个演进方案是检索增强生成(RAG)。该方法通过从外部知识源(如网络)检索信息来增强LLMs。本文剖析了几种RAG执行流水线,并阐明了其检索阶段与生成阶段之间复杂的相互作用。我们证明,虽然精确检索方案成本较高,但与近似检索变体相比,它们可以减少推理时间。这是因为精确检索模型可以向生成模型发送一个更小但更准确的文档列表,同时保持相同的端到端准确性。这一观察结果激励了针对RAG的精确最近邻搜索加速。在本工作中,我们设计了智能知识存储(IKS),这是一种2类CXL设备,它实现了横向扩展的近内存加速架构,并在主机CPU与近内存加速器之间采用了新颖的缓存一致性接口。与在英特尔Sapphire Rapids CPU上执行搜索相比,IKS在512GB向量数据库上实现了13.4至27.9倍的精确最近邻搜索加速。这一更高的搜索性能转化为代表性RAG应用端到端推理时间降低1.7至26.3倍。IKS本质上是一个内存扩展器;其内部DRAM可以被解耦并供服务器上运行的其他应用程序使用,从而防止DRAM(当今服务器中最昂贵的组件)被闲置。