Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of N:128, or N:256, for small values of N, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and N read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.
翻译:深度学习(DL)已在各种应用领域取得了前所未有的成功。与此同时,模型剪枝成为在移动应用中减少深度学习模型内存占用且不损失精度的可行方案。为使针对稠密深度学习模型构建的矩阵引擎也能处理其剪枝版本,剪枝后的深度学习模型遵循1:4或2:4的细粒度结构化稀疏模式,即每组四个连续值中,分别至少有一个或两个必须为非零值。结构化稀疏近期也向更粗粒度(宽松)的N:128或N:256模式演进(其中N为小数值),旨在覆盖深度学习模型更广泛的稀疏度范围(10%-90%)。本研究设计了一种加速器,其本质上在具有宽松结构化稀疏的宽数据块上运行。与传统的脉动阵列架构不同,该新型引擎将脉动阵列的存储部分与乘法累加单元解耦。存储模块包含1个写入端口和N个读取端口,读取端口数量等于每行非零元素的数量。乘法累加单元直接连接到每个读取端口,并按行优先乘积方式完成乘法运算。更重要的是,简单的重配置即可支持更稠密的模式。实验评估表明,与当前针对细粒度和宽松结构化稀疏构建的最先进脉动阵列引擎相比,本方案可实现显著的延迟优化。