Modern edge AI workloads demand maximum energy efficiency, motivating the pursuit of analog Compute-in-Memory (CIM) architectures. Simultaneously, the popularity of Large-Language-Models (LLMs) drives the adoption of low-bit floating-point formats which prioritize dynamic range. However, the conventional direct-accumulation CIM accommodates floating-points by normalizing them to a shared widened fixed-point scale. Consequently, hardware resolution is dictated by the input's dynamic range rather than its precision, and energy consumption is dominated by the ADC. We address this limitation by introducing local normalization for each input, weight, and multiply-accumulate (MAC) output via a Gain-Ranging MAC (GR-MAC). Normalization overhead is handled by low-power digital logic, enabling the computationally expensive MAC operation to remain in the energy-efficient low-precision analog regime. Energy modelling shows that the addition of a gain-ranging Stage to the MAC enables a 4-bit increase in input dynamic range without increased energy consumption at a 35 dB SQNR standard. Additionally, the ADC resolution requirement becomes invariant to input distribution assumptions, allowing construction of an upper bound with a 1.5-bit reduction compared to the conventional lower bound. These results establish a pathway towards unlocking favourable energy scaling trends of analog CIM for modern AI workloads.
翻译:现代边缘AI工作负载对极致能效的需求,推动了模拟存内计算架构的发展。与此同时,大语言模型的普及促使低比特浮点格式因其优先考虑动态范围而得到采用。然而,传统的直接累加式存内计算通过将浮点数归一化到共享的拓宽定点标度来容纳它们。因此,硬件分辨率由输入的动态范围而非其精度决定,且能耗主要由模数转换器主导。我们通过为每个输入、权重及乘积累加运算输出引入局部归一化(借助增益范围可调乘积累加运算单元)来解决这一局限。归一化开销由低功耗数字逻辑处理,使得计算密集的乘积累加运算得以保持在能效高的低精度模拟域。能量建模表明,在乘积累加运算中增加增益范围调节级,可在保持35 dB信号量化噪声比标准的前提下,使输入动态范围增加4比特而不增加能耗。此外,模数转换器的分辨率要求变得与输入分布假设无关,从而能够构建一个相较于传统下界降低1.5比特的上界。这些结果为现代AI工作负载释放模拟存内计算有利的能量缩放趋势开辟了路径。