Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community. Yet, much less is known about the interaction between sparsity and the standard stochastic optimization techniques used for training sparse networks, and most existing work uses standard dense schedules and hyperparameters for training sparse networks. In this work, we examine the impact of high sparsity on model training using the standard computer vision and natural language processing sparsity benchmarks. We begin by showing that using standard dense training recipes for sparse training is suboptimal, and results in under-training. We provide new approaches for mitigating this issue for both sparse pre-training of vision models (e.g. ResNet50/ImageNet) and sparse fine-tuning of language models (e.g. BERT/GLUE), achieving state-of-the-art results in both settings in the high-sparsity regime, and providing detailed analyses for the difficulty of sparse training in both scenarios. Our work sets a new threshold in terms of the accuracies that can be achieved under high sparsity, and should inspire further research into improving sparse model training, to reach higher accuracies under high sparsity, but also to do so efficiently.
翻译:获得既高度精确又高度稀疏的深度神经网络版本,是模型压缩领域的主要挑战之一,社区已研究出多种高性能剪枝技术。然而,关于稀疏性与用于训练稀疏网络的标准随机优化技术之间的相互作用,目前所知甚少,且大多数现有工作使用标准稠密调度和超参数来训练稀疏网络。本研究通过标准计算机视觉和自然语言处理稀疏性基准,考察高稀疏性对模型训练的影响。我们首先证明,使用标准稠密训练配方进行稀疏训练是次优的,会导致训练不足。我们提供了新方法来解决这一问题,适用于视觉模型的稀疏预训练(如ResNet50/ImageNet)和语言模型的稀疏微调(如BERT/GLUE),在高稀疏度条件下均取得了最先进的结果,并对两种场景下稀疏训练的难点进行了详细分析。我们的工作设定了高稀疏度下可实现精度的新阈值,并应激励进一步研究改进稀疏模型训练,以在高稀疏度下达到更高精度,同时高效实现这一目标。