Modern DNN workloads increasingly rely on activation functions consisting of computationally complex operations. This poses a challenge to current accelerators optimized for convolutions and matrix-matrix multiplications. This work presents Flex-SFU, a lightweight hardware accelerator for activation functions implementing non-uniform piecewise interpolation supporting multiple data formats. Non-Uniform segments and floating-point numbers are enabled by implementing a binary-tree comparison within the address decoding unit. An SGD-based optimization algorithm with heuristics is proposed to find the interpolation function reducing the mean squared error. Thanks to non-uniform interpolation and floating-point support, Flex-SFU achieves on average 22.3x better mean squared error compared to previous piecewise linear interpolation approaches. The evaluation with more than 700 computer vision and natural language processing models shows that Flex-SFU can, on average, improve the end-to-end performance of state-of-the-art AI hardware accelerators by 35.7%, achieving up to 3.3x speedup with negligible impact in the models' accuracy when using 32 segments, and only introducing an area and power overhead of 5.9% and 0.8% relative to the baseline vector processing unit.
翻译:摘要:现代深度神经网络(DNN)工作负载日益依赖由计算复杂操作构成的激活函数,这对当前针对卷积和矩阵-矩阵乘法优化的加速器提出了挑战。本文提出Flex-SFU——一种轻量级硬件加速器,通过实现支持多种数据格式的非均匀分段插值来加速激活函数。通过在地址解码单元中引入二叉树比较机制,实现了非均匀分段与浮点数处理。我们提出了一种基于随机梯度下降(SGD)并融合启发式策略的优化算法,用于求解降低均方误差的插值函数。得益于非均匀插值与浮点数支持,Flex-SFU相较于以往分段线性插值方法,平均均方误差降低了22.3倍。基于超过700个计算机视觉与自然语言处理模型的评估表明:在使用32个分段时,Flex-SFU平均可将现有顶尖AI硬件加速器的端到端性能提升35.7%,最高可达3.3倍加速,且对模型精度的影响可忽略不计;相较于基线向量处理单元,仅引入5.9%的面积开销与0.8%的功耗开销。