Deploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation. Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework that jointly learns Compression and Quantization to produce CiM-compatible low-bit embeddings for diverse CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.
翻译:在边缘设备上部署检索增强生成(RAG)的需求日益增长,但传统架构中大规模数据移动和计算带来的延迟阻碍了其应用。存内计算(CiM)架构通过在其交叉阵列结构内直接执行向量搜索,解决了这一瓶颈。然而,CiM在RAG中的应用受到一个根本性“表示鸿沟”的限制,即高精度、高维度的嵌入与CiM的低精度、低维度阵列约束不相容。这一鸿沟因CiM实现的多样性(如SRAM、ReRAM、FeFET)而加剧,每种实现都具有独特的设计(例如2位单元、512x512阵列)。因此,RAG数据必须被简单地重新整形以适应每个目标实现。当前的数据整形方法分别处理维度和精度,这会降低数据保真度。这不仅抵消了CiM在RAG中的优势,还使硬件设计者感到困惑,难以确定故障是源于电路设计还是退化的输入数据。因此,CiM的采用仍然有限。本文中,我们提出了CQ-CiM,一个统一的、硬件感知的数据整形框架,它联合学习压缩和量化,为不同的CiM设计生成兼容CiM的低比特嵌入。据我们所知,这是首个为RAG全面使用CiM而进行数据整形的工作。