At-Scale Evaluation of Weight Clustering to Enable Energy-Efficient Object Detection

Accelerators implementing Deep Neural Networks for image-based object detection operate on large volumes of data due to fetching images and neural network parameters, especially if they need to process video streams, hence with high power dissipation and bandwidth requirements to fetch all those data. While some solutions exist to mitigate power and bandwidth demands for data fetching, they are often assessed in the context of limited evaluations with a scale much smaller than that of the target application, which challenges finding the best tradeoff in practice. This paper sets up the infrastructure to assess at-scale a key power and bandwidth optimization - weight clustering - for You Only Look Once v3 (YOLOv3), a neural network-based object detection system, using videos of real driving conditions. Our assessment shows that accelerators such as systolic arrays with an Output Stationary architecture turn out to be a highly effective solution combined with weight clustering. In particular, applying weight clustering independently per neural network layer, and using between 32 (5-bit) and 256 (8-bit) weights allows achieving an accuracy close to that of the original YOLOv3 weights (32-bit weights). Such bit-count reduction of the weights allows shaving bandwidth requirements down to 30%-40% of the original requirements, and reduces energy consumption down to 45%. This is based on the fact that (i) energy due to multiply-and-accumulate operations is much smaller than DRAM data fetching, and (ii) designing accelerators appropriately may make that most of the data fetched corresponds to neural network weights, where clustering can be applied. Overall, our at-scale assessment provides key results to architect camera-based object detection accelerators by putting together a real-life application (YOLOv3), and real driving videos, in a unified setup so that trends observed are reliable.

翻译：基于深度神经网络的图像目标检测加速器因需获取图像和神经网络参数（尤其在处理视频流时）而处理大量数据，因此此类加速器存在高功耗和高带宽需求。尽管已有部分解决方案可降低数据获取的功耗与带宽需求，但相关评估往往局限于规模远小于实际应用场景的测试，这使得在工程实践中难以找到最优权衡。本文构建了面向大规模评估的基础设施，针对基于神经网络的You Only Look Once v3（YOLOv3）目标检测系统，使用真实驾驶场景视频评估其关键的功耗与带宽优化技术——权重聚类。评估表明：采用输出固定架构的脉动阵列等加速器与权重聚类相结合能实现高效性能。具体而言，对每个神经网络层独立进行权重聚类，并使用32（5比特）至256（8比特）个聚类权重，可获得接近原始YOLOv3（32比特权重）的精度。权重比特数的降低可将带宽需求压缩至原始需求的30%-40%，同时能耗降低至45%。这一结论基于以下两点：(i) 乘累加运算能耗远低于DRAM数据获取能耗；(ii) 通过合理设计加速器，可使大部分获取数据对应神经网络权重，从而适用聚类优化。总体而言，本大规模评估通过整合真实应用（YOLOv3）与真实驾驶视频的统一框架，提供了可靠的趋势性结论，为基于摄像头的目标检测加速器架构设计提供关键依据。