Structured sparsity has been proposed as an efficient way to prune the complexity of modern Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. The acceleration of ML models - for both training and inference - relies primarily on equivalent matrix multiplications that can be executed efficiently on vector processors or custom matrix engines. The goal of this work is to incorporate the simplicity of structured sparsity into vector execution, thereby accelerating the corresponding matrix multiplications. Toward this objective, a new vector index-multiply-accumulate instruction is proposed, which enables the implementation of lowcost indirect reads from the vector register file. This reduces unnecessary memory traffic and increases data locality. The proposed new instruction was integrated in a decoupled RISCV vector processor with negligible hardware cost. Extensive evaluation demonstrates significant speedups of 1.80x-2.14x, as compared to state-of-the-art vectorized kernels, when executing layers of varying sparsity from state-of-the-art Convolutional Neural Networks (CNNs).
翻译:结构化稀疏性被提出作为简化现代机器学习(ML)应用复杂性和硬件中稀疏数据处理的一种有效方法。ML模型(包括训练和推理)的加速主要依赖于可在向量处理器或定制矩阵引擎上高效执行的等效矩阵乘法。本工作的目标是将结构化稀疏性的简洁性融入向量执行,从而加速相应的矩阵乘法。为此,本文提出了一种新的向量索引-乘累加指令,该指令能够实现对向量寄存器文件的低成本间接读取,从而减少不必要的内存流量并提高数据局部性。该新指令以极低的硬件开销集成在一个解耦的RISC-V向量处理器中。广泛评估表明,与现有最先进的向量化内核相比,在执行来自最先进卷积神经网络(CNN)的不同稀疏性层时,该方法实现了1.80倍至2.14倍的显著加速比。