With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.
翻译:随着深度神经网络在多个领域的成功应用,训练和部署大型网络所需的计算与存储资源已成为进一步优化的瓶颈。因此,稀疏化方法成为解决这些问题的前沿技术。本文提出了一种基于Bridge范数(即训练过程中的 $L_p$ 正则化)的简洁高效稀疏化方案。我们引入了一种新的权重衰减机制,将标准的 $L_2$ 权重衰减推广至任意 $p$ 范数。实验证明,该机制与自适应优化器兼容,并避免了 $0<p<1$ 范数导致的梯度发散问题。我们通过实证表明,该方法在保持与标准 $L_2$ 正则化相当的泛化性能的同时,能够生成高度稀疏的网络结构。