Deep-learning accelerators are increasingly in demand; however, their performance is constrained by the size of the feature map, leading to high bandwidth requirements and large buffer sizes. We propose an adaptive scale feature map compression technique leveraging the unique properties of the feature map. This technique adopts independent channel indexing given the weak channel correlation and utilizes a cubical-like block shape to benefit from strong local correlations. The method further optimizes compression using a switchable endpoint mode and adaptive scale interpolation to handle unimodal data distributions, both with and without outliers. This results in 4$\times$ and up to 7.69$\times$ compression rates for 16-bit data in constant and variable bitrates, respectively. Our hardware design minimizes area cost by adjusting interpolation scales, which facilitates hardware sharing among interpolation points. Additionally, we introduce a threshold concept for straightforward interpolation, preventing the need for intricate hardware. The TSMC 28nm implementation showcases an equivalent gate count of 6135 for the 8-bit version. Furthermore, the hardware architecture scales effectively, with only a sublinear increase in area cost. Achieving a 32$\times$ throughput increase meets the theoretical bandwidth of DDR5-6400 at just 7.65$\times$ the hardware cost.
翻译:摘要:深度学习加速器的需求日益增长,但其性能受限于特征图规模,导致带宽需求高且缓冲区尺寸大。我们提出一种利用特征图独特性质的自适应尺度压缩技术。该技术采用独立通道索引以应对弱通道相关性,并利用类立方体块形状以从强局部相关性中获益。该方法进一步通过可切换端点模式和自适应尺度插值优化压缩,以处理包含与未包含离群值的单峰数据分布。对于16位数据,该方法在恒定比特率和可变比特率下分别实现4倍和高达7.69倍的压缩率。我们的硬件设计通过调整插值尺度最小化面积开销,从而促进插值点间的硬件共享。此外,我们引入阈值概念以实现直接插值,避免复杂的硬件设计。基于台积电28nm工艺的实现表明,8位版本等效门数为6135。硬件架构具有良好的可扩展性,面积开销仅呈亚线性增长。在硬件成本仅为7.65倍的条件下,可实现32倍的吞吐量提升,满足DDR5-6400的理论带宽。