To address the increasing computational demands of artificial intelligence (AI) and big data, compute-in-memory (CIM) integrates memory and processing units into the same physical location, reducing the time and energy overhead of the system. Despite advancements in non-volatile memory (NVM) for matrix multiplication, other critical data-intensive operations, like parallel search, have been overlooked. Current parallel search architectures, namely content-addressable memory (CAM), often use binary, which restricts density and functionality. We present an analog CAM (ACAM) cell, built on two complementary ferroelectric field-effect transistors (FeFETs), that performs parallel search in the analog domain with over 40 distinct match windows. We then deploy it to calculate similarity between vectors, a building block in the following two machine learning problems. ACAM outperforms ternary CAM (TCAM) when applied to similarity search for few-shot learning on the Omniglot dataset, yielding projected simulation results with improved inference accuracy by 5%, 3x denser memory architecture, and more than 100x faster speed compared to central processing unit (CPU) and graphics processing unit (GPU) per similarity search on scaled CMOS nodes. We also demonstrate 1-step inference on a kernel regression model by combining non-linear kernel computation and matrix multiplication in ACAM, with simulation estimates indicating 1,000x faster inference than CPU and GPU.
翻译:为应对人工智能和大数据日益增长的计算需求,存内计算将存储与处理单元集成于同一物理位置,从而降低系统的能耗与时间开销。尽管非易失性存储器在矩阵乘法领域取得进展,但并行搜索等关键数据密集型操作仍被忽视。当前并行搜索架构(即内容寻址存储器)常采用二进制模式,这限制了存储密度与功能。我们提出一种基于两个互补铁电场效应晶体管的模拟内容寻址存储器单元,可在模拟域中执行并行搜索,支持超过40种不同匹配窗口。进而将其应用于向量相似度计算——该运算作为以下两个机器学习问题的基础模块。在Omniglot数据集的小样本学习相似度搜索任务中,模拟内容寻址存储器优于三态内容寻址存储器:仿真结果表明,在缩放CMOS节点上,其推理精度提升5%,存储架构密度提高3倍,且每次相似度搜索速度较CPU与GPU提升逾100倍。我们同时展示了通过将非线性核计算与矩阵乘法融合于模拟内容寻址存储器,实现核回归模型的单步推理,仿真估算显示其推理速度较CPU与GPU提升1000倍。