We introduce a new algorithm called the Free-pipeline Fast Inner Product (FFIP) and its hardware architecture that improve an under-explored fast inner-product algorithm (FIP) proposed by Winograd in 1968. Unlike the unrelated Winograd minimal filtering algorithms for convolutional layers, FIP is applicable to all machine learning (ML) model layers that can mainly decompose to matrix multiplication, including fully-connected, convolutional, recurrent, and attention/transformer layers. We implement FIP for the first time in an ML accelerator then present our FFIP algorithm and generalized architecture which inherently improve FIP's clock frequency and, as a consequence, throughput for a similar hardware cost. Finally, we contribute ML-specific optimizations for the FIP and FFIP algorithms and architectures. We show that FFIP can be seamlessly incorporated into traditional fixed-point systolic array ML accelerators to achieve the same throughput with half the number of multiply-accumulate (MAC) units, or it can double the maximum systolic array size that can fit onto devices with a fixed hardware budget. Our FFIP implementation for non-sparse ML models with 8 to 16-bit fixed-point inputs achieves higher throughput and compute efficiency than the best-in-class prior solutions on the same type of compute platform.
翻译:我们提出一种名为自由流水线快速内积(FFIP)的新型算法及其硬件架构,该算法改进了Winograd于1968年提出但未被充分探索的快速内积算法(FIP)。与用于卷积层且不相关的Winograd最小滤波算法不同,FIP适用于所有主要可分解为矩阵乘法的机器学习模型层,包括全连接层、卷积层、循环层及注意力/Transformer层。我们首次在机器学习加速器中实现FIP,随后提出FFIP算法及其通用架构,该架构从根本上提升了FIP的时钟频率,从而在相近硬件成本下提升了吞吐量。最后,我们贡献了针对FIP和FFIP算法及架构的机器学习专用优化方法。研究表明,FFIP可无缝集成至传统定点脉动阵列机器学习加速器中,在所需乘法累加单元数量减半的情况下实现相同吞吐量;也可在固定硬件预算下,使适配于设备的脉动阵列最大规模翻倍。针对采用8至16位定点输入的非稀疏机器学习模型,我们的FFIP实现相比同类计算平台上的最佳现有方案,获得了更高的吞吐量与计算效率。