A recent empirical observation (Li et al., 2022b) of activation sparsity in MLP blocks offers an opportunity to drastically reduce computation costs for free. Although having attributed it to training dynamics, existing theoretical explanations of activation sparsity are restricted to shallow networks, small training steps and special training, despite its emergence in deep models standardly trained for a large number of steps. To fill these gaps, we propose the notion of gradient sparsity as one source of activation sparsity and a theoretical explanation based on it that sees sparsity a necessary step to adversarial robustness w.r.t. hidden features and parameters, which is approximately the flatness of minima for well-learned models. The theory applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or other architectures trained with weight noises. Eliminating other sources of flatness except for sparsity, we discover the phenomenon that the ratio between the largest and smallest non-zero singular values of weight matrices is small. When discussing the emergence of this spectral concentration, we use random matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises. Validational experiments are conducted to verify our gradient-sparsity-based explanation. We propose two plug-and-play modules for both training and finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their 50% sparsity improvements, indicating further potential cost reduction in both training and inference.
翻译:近期一项实证观察(Li等人,2022b)发现MLP模块中的激活稀疏性可免费大幅降低计算成本。尽管已有研究将其归因于训练动态,现有激活稀疏性的理论解释仍局限于浅层网络、小训练步数及特殊训练方式,而该现象在标准训练的大规模深度模型中广泛存在。为填补这些空白,我们提出梯度稀疏性作为激活稀疏性的一种来源,并基于此建立理论解释:稀疏性是实现关于隐藏特征与参数的对抗鲁棒性的必要步骤,后者近似于良好学习模型的极小值平坦性。该理论适用于标准训练的LayerNorm MLP,并进一步推广至含权重噪声训练的Transformer及其他架构。通过排除稀疏性以外的平坦性来源,我们发现权重矩阵的最大与最小非零奇异值之比趋小。在探讨这种谱集中现象的成因时,我们运用随机矩阵理论(RMT)作为分析随机梯度噪声的有力工具。我们通过验证性实验证实了基于梯度稀疏性的解释,并提出了两种即插即用的稀疏化训练与微调模块。在ImageNet-1k与C4数据集上的实验表明,该模块可实现50%的稀疏度提升,进一步证明了训练与推理阶段的计算成本降低潜力。