Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. With the ever-increasing need for faster computation and lower power consumption, driven by real-time systems and Internet-of-Things (IoT) devices, FPGAs have emerged as suitable devices for deep learning inference. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning, quantization and knowledge distillation, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multiplications and memory. However, pruning often fails to capture properties of the underlying hardware, causing unstructured sparsity and load-balance inefficiency, thus bottlenecking resource improvements. We propose a hardware-centric formulation of pruning, by formulating it as a knapsack problem with resource-aware tensor structures. The primary emphasis is on real-time inference, with latencies in the order of 1$\mu$s, accelerated with hls4ml, an open-source framework for deep learning inference on FPGAs. Evaluated on a range of tasks, including real-time particle classification at CERN's Large Hadron Collider and fast image classification, the proposed method achieves a reduction ranging between 55% and 92% in the utilization of digital signal processing blocks (DSP) and up to 81% in block memory (BRAM) utilization.
翻译:神经网络在图像分类、语音识别、科学分析及更多应用领域达到了最先进的性能。随着实时系统和物联网(IoT)设备对更快速计算和更低功耗需求的不断增加,FPGA已成为深度学习推理的适用设备。由于神经网络的高计算复杂度和内存占用,文献中提出了多种压缩技术,如剪枝、量化和知识蒸馏。剪枝通过稀疏化神经网络减少了乘法运算次数和内存使用。然而,剪枝通常未能捕捉底层硬件的特性,导致非结构化稀疏性和负载均衡效率低下,从而成为资源优化的瓶颈。我们提出了一种以硬件为中心的剪枝方法,将其形式化为具有资源感知张量结构的背包问题。主要关注点在于实时推理,其延迟在微秒量级(约1μs),并通过hls4ml(一个用于FPGA上深度学习推理的开源框架)进行加速。该方法在一系列任务上进行了评估,包括欧洲核子研究中心大型强子对撞机的实时粒子分类和快速图像分类,结果表明,数字信号处理块(DSP)的利用率降低了55%至92%,块内存(BRAM)利用率降低了高达81%。