Deploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation. Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework that jointly learns Compression and Quantization to produce CiM-compatible low-bit embeddings for diverse CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.
翻译:在边缘设备上部署检索增强生成(RAG)的需求日益增长,但传统架构中大规模数据移动与计算所产生的高延迟阻碍了其应用。计算内存(CiM)架构通过在其交叉阵列结构内部直接执行向量搜索,有效解决了这一瓶颈。然而,CiM在RAG中的应用受到一个根本性“表示鸿沟”的限制:高精度、高维度的嵌入向量与CiM的低精度、低维度阵列约束不相容。这一鸿沟因CiM实现方案的多样性(如SRAM、ReRAM、FeFET)而进一步加剧,每种方案均具有独特的设计参数(例如2位单元、512×512阵列)。因此,RAG数据往往需要被简单重塑以适应每种目标实现。现有的数据塑形方法将维度与精度处理分离,这损害了数据保真度。此举不仅抵消了CiM用于RAG的优势,也使硬件设计者感到困惑,难以判断故障是源于电路设计还是劣化的输入数据。因此,CiM的采用仍然受限。本文提出CQ-CiM,一个统一的硬件感知数据塑形框架,通过联合学习压缩与量化,为多样化的CiM设计生成兼容的低比特嵌入向量。据我们所知,这是首个为RAG全面使用CiM而进行数据塑形的研究工作。