Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community. Yet, much less is known about the interaction between sparsity and the standard stochastic optimization techniques used for training sparse networks, and most existing work uses standard dense schedules and hyperparameters for training sparse networks. In this work, we examine the impact of high sparsity on model training using the standard computer vision and natural language processing sparsity benchmarks. We begin by showing that using standard dense training recipes for sparse training is suboptimal, and results in under-training. We provide new approaches for mitigating this issue for both sparse pre-training of vision models (e.g. ResNet50/ImageNet) and sparse fine-tuning of language models (e.g. BERT/GLUE), achieving state-of-the-art results in both settings in the high-sparsity regime, and providing detailed analyses for the difficulty of sparse training in both scenarios. Our work sets a new threshold in terms of the accuracies that can be achieved under high sparsity, and should inspire further research into improving sparse model training, to reach higher accuracies under high sparsity, but also to do so efficiently.
翻译:获得既高度精确又高度稀疏的深度神经网络版本,是模型压缩领域的主要挑战之一,研究界已探索了多种高性能剪枝技术。然而,关于稀疏性与训练稀疏网络时使用的标准随机优化技术之间的相互作用,我们了解得还很少,且大多数现有工作使用标准密集调度和超参数来训练稀疏网络。在本工作中,我们使用标准的计算机视觉和自然语言处理稀疏性基准,研究了高稀疏度对模型训练的影响。我们首先表明,使用标准密集训练方案进行稀疏训练是次优的,会导致训练不足。我们提出了新方法来缓解这一问题,包括视觉模型的稀疏预训练(如ResNet50/ImageNet)和语言模型的稀疏微调(如BERT/GLUE),在高稀疏度范围内两种设置下均取得了最先进的结果,并详细分析了两种场景下稀疏训练的难点。我们的工作设定了在高稀疏度下可达到的精度新阈值,并应能启发进一步研究,以改进稀疏模型训练,从而在高稀疏度下达到更高精度,同时高效实现这一目标。