Deep Neural Networks increasingly employ low-precision quantization to reduce computational requirements. While FPGAs are well suited for workloads with heterogeneous precisions, their dedicated digital signal processing (DSP) slices only feature fixed-width datapaths that are significantly underutilized by low-bitwidth arithmetic. While previous approaches have already introduced the packing of multiple values onto the same wide DSP datapath, they either only support specific fixed bitwidths or are wasteful regarding the use of additional support logic external to the DSP. This paper proposes an efficient method to dynamically pack multiple (un-)signed inputs with arbitrary bitwidths into a wide multiplier path by leveraging the DSP's internal pre-adder. Building on this, we present two distinct architectures, one optimized for matrix-vector multiplications and the other for convolutions. Our implementations are integrated into AMD's FINN framework. With these optimizations, we reduce the LUT utilization by 21% and increase the FPS/DSP by 36% for the UltraNet model compared to the FINN reference.
翻译:深度神经网络越来越多地采用低精度量化来降低计算需求。尽管FPGA非常适合处理异构精度的工作负载,但其专用数字信号处理(DSP)切片仅具有固定宽度的数据路径,低比特宽度的算术运算会显著浪费这些资源。以往的方法虽然已提出将多个数值打包到同一宽DSP数据路径上,但它们要么仅支持特定的固定比特宽度,要么在DSP外部额外使用支持逻辑时造成资源浪费。本文提出了一种高效方法,通过利用DSP内部的预加法器,将多个具有任意比特宽度的(无)符号输入动态打包到宽乘法器路径中。在此基础上,我们提出了两种不同的架构:一种针对矩阵向量乘法优化,另一种针对卷积优化。我们的实现已集成到AMD的FINN框架中。通过这些优化,与FINN参考设计相比,UltraNet模型的LUT利用率降低了21%,每DSP的帧率(FPS)增加了36%。