Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the learning HPs originally crafted for dense models. Unfortunately, we show sparse and dense networks do not share the same optimal HPs. Without stable dynamics and effective training recipes, it is costly to test sparsity at scale, which is key to surpassing dense networks and making the business case for sparsity acceleration in hardware. A holistic approach is needed to tackle these challenges and we propose S$\mu$Par as one such approach. S$\mu$Par ensures activations, gradients, and weight updates all scale independently of sparsity level. Further, by reparameterizing the HPs, S$\mu$Par enables the same HP values to be optimal as we vary both sparsity level and model width. HPs can be tuned on small dense networks and transferred to large sparse models, greatly reducing tuning costs. On large-scale language modeling, S$\mu$Par training improves loss by up to 8.2% over the common approach of using the dense model standard parameterization.
翻译:稀疏神经网络在与稠密模型竞争时面临若干挑战。首先,将大部分权重设为零会损害前向与梯度信号的传播。其次,稀疏研究通常需要测试多种稀疏度,同时引入新的超参数(HPs),导致调优成本过高。事实上,标准做法是直接沿用为稠密模型设计的原有学习超参数。然而,我们证明稀疏网络与稠密网络并不共享相同的最优超参数。若缺乏稳定的动态特性和有效的训练方案,大规模测试稀疏度的成本将极为高昂,而这正是超越稠密模型、在硬件层面实现稀疏加速商业价值的关键。需要一种整体性方法应对这些挑战,我们提出的S$\mu$Par正是这样一种方案。S$\mu$Par确保激活值、梯度和权重更新均能独立于稀疏度进行缩放。此外,通过对超参数进行重新参数化,S$\mu$Par使得同一组超参数值在稀疏度与模型宽度同时变化时仍保持最优。超参数可在小型稠密网络上完成调优,并迁移至大型稀疏模型,从而大幅降低调优成本。在大规模语言建模任务中,相较于采用稠密模型标准参数化的常规方法,S$\mu$Par训练将损失降低了最高达8.2%。