The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall problem without much success, and sometimes even worsening the issue since more compute units also require higher memory bandwidth. Prior works have proposed the design of memory-centric architectures based on the Near-Data Processing (NDP) paradigm. NDP seeks to break the memory wall by moving the computations closer to the memory hierarchy, reducing the data movements and their cost as much as possible. The 3D-stacked memory is especially appealing for DNN accelerators due to its high-density/low-energy storage and near-memory computation capabilities to perform the DNN operations massively in parallel. However, memory accesses remain as the main bottleneck for running modern DNNs efficiently. To improve the efficiency of DNN inference we present QeiHaN, a hardware accelerator that implements a 3D-stacked memory-centric weight storage scheme to take advantage of a logarithmic quantization of activations. In particular, since activations of FC and CONV layers of modern DNNs are commonly represented as powers of two with negative exponents, QeiHaN performs an implicit in-memory bit-shifting of the DNN weights to reduce memory activity. Only the meaningful bits of the weights required for the bit-shift operation are accessed. Overall, QeiHaN reduces memory accesses by 25\% compared to a standard memory organization. We evaluate QeiHaN on a popular set of DNNs. On average, QeiHaN provides $4.3x$ speedup and $3.5x$ energy savings over a Neurocube-like accelerator.
翻译:深度神经网络(DNN)的持续增长使其在传统以计算为核心的架构上难以高效实现和运行。部分加速器尝试通过增加计算单元和片上缓冲区来突破存储墙问题,但收效甚微,甚至因需要更高内存带宽而加剧该问题。已有研究提出了基于近数据处理(NDP)范式的内存中心型架构。NDP通过将计算迁移至靠近内存层次结构的位置,尽可能减少数据移动及其成本,从而突破存储墙。3D堆叠内存因其高密度/低能耗存储及近内存计算能力(可大规模并行执行DNN运算)对DNN加速器尤为具有吸引力。然而,内存访问仍是高效运行现代DNN的主要瓶颈。为提升DNN推理效率,我们提出QeiHaN硬件加速器,该加速器实现了一种基于3D堆叠内存的权重存储方案,利用对激活值的对数量化特性。具体而言,由于现代DNN中全连接层(FC)和卷积层(CONV)的激活值通常以含负指数的2的幂形式表示,QeiHaN通过对DNN权重进行隐式内存内位移操作来减少内存活动。仅需访问位移操作所需的权重有效位。总体而言,QeiHaN相比标准内存组织方式减少了25%的内存访问。我们在主流DNN基准集上评估了QeiHaN。平均而言,QeiHaN相较于类Neurocube加速器实现了4.3倍加速和3.5倍能耗节省。