Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
翻译:大型语言模型(LLMs)在边缘设备上部署时,需通过微调和更新部分参数进行学习。尽管此类学习方法可通过优化降低资源消耗,但总体所需资源仍对边缘设备构成沉重负担。相比之下,检索增强生成(RAG)作为一种资源高效的LLM学习方法,无需更新模型参数即可提升LLM生成内容的质量。然而,基于RAG的LLM可能在每次用户与LLM交互时对用户档案数据进行重复检索。这种检索会随着用户数据积累导致显著延迟。传统降低延迟的方法会限制用户数据的存储规模,从而削弱RAG在用户数据持续增长时的可扩展性。如何使RAG突破边缘设备上延迟与可扩展性的约束仍是未解难题?本文提出一种基于存内计算(CiM)架构加速RAG的新型框架。该框架通过直接在存储器内进行原位计算加速矩阵乘法,同时避免计算单元与存储器之间的昂贵数据传输。我们提出的鲁棒CiM支撑RAG(RoCR)框架,采用新颖的对比学习训练方法与噪声感知训练,使RAG能够利用CiM高效检索用户档案数据。据我们所知,这是首个利用CiM加速RAG的研究工作。