Interconnect power consumption remains a bottleneck in Deep Neural Network (DNN) accelerators. While ordering data based on '1'-bit counts can mitigate this via reduced switching activity, practical hardware sorting implementations remain underexplored. This work proposes the hardware implementation of a comparison-free sorting unit optimized for Convolutional Neural Networks (CNN). By leveraging approximate computing to group population counts into coarse-grained buckets, our design achieves hardware area reductions while preserving the link power benefits of data reordering. Our approximate sorting unit achieves up to 35.4% area reduction while maintaining 19.50\% BT reduction compared to 20.42% of precise implementation.
翻译:互连功耗仍然是深度神经网络(DNN)加速器中的一个瓶颈。虽然基于'1'位计数对数据进行排序可以通过降低翻转活动来缓解此问题,但实用的硬件排序实现方案仍未得到充分探索。本研究提出了一种专为卷积神经网络(CNN)优化的免比较排序单元的硬件实现方案。通过利用近似计算将总体计数分组到粗粒度桶中,我们的设计在保持数据重排序带来的链路功耗优势的同时,实现了硬件面积的缩减。与精确实现方案相比,我们的近似排序单元在保持19.50%的BT(位翻转)降低(精确方案为20.42%)的同时,实现了高达35.4%的面积缩减。