The increasing computational demand of AI workloads has intensified the need for energy-efficient in-memory and near-memory computing architectures, particularly because data movement often consumes significantly more energy than computation itself. While fully digital architectures provide robust scalability and support higher-resolution computation, analog in-memory computing has demonstrated improved energy efficiency for low-precision workloads. However, its reliance on peripheral DACs and ADCs introduces additional power, area, and design overhead. To address these challenges, this work presents a time-domain near-memory computing architecture for low-precision multiply-and-accumulate (MAC) operations. In the proposed approach, digital weight bits stored in SRAM are converted using a current-steering DAC, while the digital input vector is encoded by an N-pulse generator. This enables multiplication to be performed in the time domain while maintaining a digital-friendly interface. Two accumulation schemes, a delay-cell-based architecture and a counter-based architecture, are investigated and compared in terms of design trade-offs, linearity, scalability, and power efficiency. To improve technology portability, the N-pulse generator and counters are implemented using RTL synthesis, while the current-steering DAC remains in the analog domain. A 4 x 4 MAC prototype is implemented with a 1 V supply, achieving an operating frequency of 40 MHz, power consumption of 42 uW, and energy efficiency of 7.62 TOPS/W.
翻译:人工智能工作负载不断增长的计算需求,加剧了对高能效存内计算与近存计算架构的需求,尤其是因为数据移动消耗的能量通常远超计算本身。全数字架构虽能提供稳健的可扩展性并支持高精度计算,但模拟存内计算已在低精度工作负载上展现出更高的能效。然而,其对外围数模转换器(DAC)和模数转换器(ADC)的依赖引入了额外的功耗、面积和设计开销。为解决这些挑战,本文提出一种面向低精度乘累加(MAC)运算的时域近存计算架构。在所提方法中,存储在SRAM中的数字权重位通过电流舵DAC进行转换,而数字输入向量则由N脉冲生成器编码。这使得乘法运算可在时域中完成,同时保持数字友好的接口。本文研究和比较了两种累加方案——基于延迟单元的架构和基于计数器的架构,并从设计权衡、线性度、可扩展性和能效方面进行了评估。为提升工艺可移植性,N脉冲生成器和计数器采用RTL综合实现,而电流舵DAC保留在模拟域。基于1V电源电压,实现了4×4 MAC原型,工作频率达40 MHz,功耗为42 μW,能效为7.62 TOPS/W。