Dynamic Sparsity Is Channel-Level Sparsity Learner

Sparse training has received an upsurging interest in machine learning due to its tantalizing saving potential for the entire training process as well as inference. Dynamic sparse training (DST), as a leading sparse training approach, can train deep neural networks at high sparsity from scratch to match the performance of their dense counterparts. However, most if not all DST prior arts demonstrate their effectiveness on unstructured sparsity with highly irregular sparse patterns, which receives limited support in common hardware. This limitation hinders the usage of DST in practice. In this paper, we propose Channel-aware dynamic sparse (Chase), which for the first time seamlessly translates the promise of unstructured dynamic sparsity to GPU-friendly channel-level sparsity (not fine-grained N:M or group sparsity) during one end-to-end training process, without any ad-hoc operations. The resulting small sparse networks can be directly accelerated by commodity hardware, without using any particularly sparsity-aware hardware accelerators. This appealing outcome is partially motivated by a hidden phenomenon of dynamic sparsity: off-the-shelf unstructured DST implicitly involves biased parameter reallocation across channels, with a large fraction of channels (up to 60%) being sparser than others. By progressively identifying and removing these channels during training, our approach translates unstructured sparsity to channel-wise sparsity. Our experimental results demonstrate that Chase achieves 1.7 X inference throughput speedup on common GPU devices without compromising accuracy with ResNet-50 on ImageNet. We release our codes in https://github.com/luuyin/chase.

翻译：稀疏训练因其在训练全过程及推理中显著的节省潜力，在机器学习领域引起了广泛关注。作为领先的稀疏训练方法，动态稀疏训练（DST）能够从零开始以高稀疏度训练深度神经网络，并达到与密集模型相当的性能。然而，现有的大多数DST方法都基于高度不规则稀疏模式的非结构化稀疏性进行验证，这在通用硬件上支持有限，从而限制了DST在实际中的应用。本文提出通道感知动态稀疏方法（Chase），首次在单次端到端训练过程中无缝地将非结构化动态稀疏性的潜力转化为GPU友好的通道级稀疏性（而非细粒度N:M稀疏或组稀疏），且无需任何特设操作。所得的小型稀疏网络可直接在商用硬件上加速，无需使用任何专用稀疏感知硬件加速器。这一引人注目的成果部分源于动态稀疏性中一个隐藏现象：现成的非结构化DST隐式地涉及跨通道的偏置参数重分配，其中大量通道（高达60%）的稀疏程度高于其他通道。通过在训练过程中逐步识别并移除这些通道，我们的方法将非结构化稀疏性转化为逐通道稀疏性。实验结果表明，Chase在ImageNet数据集上使用ResNet-50时，在通用GPU设备上实现了1.7倍的推理吞吐量加速，且未牺牲精度。我们已在https://github.com/luuyin/chase 发布代码。