Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness

A recent empirical observation of activation sparsity in MLP layers offers an opportunity to drastically reduce computation costs for free. Despite several works attributing it to training dynamics, the theoretical explanation of activation sparsity's emergence is restricted to shallow networks, small training steps well as modified training, even though the sparsity has been found in deep models trained by vanilla protocols for large steps. To fill the three gaps, we propose the notion of gradient sparsity as the source of activation sparsity and a theoretical explanation based on it that explains gradient sparsity and then activation sparsity as necessary steps to adversarial robustness w.r.t. hidden features and parameters, which is approximately the flatness of minima for well-learned models. The theory applies to standardly trained LayerNorm-ed pure MLPs, and further to Transformers or other architectures if noises are added to weights during training. To eliminate other sources of flatness when arguing sparsities' necessity, we discover the phenomenon of spectral concentration, i.e., the ratio between the largest and the smallest non-zero singular values of weight matrices is small. We utilize random matrix theory (RMT) as a powerful theoretical tool to analyze stochastic gradient noises and discuss the emergence of spectral concentration. With these insights, we propose two plug-and-play modules for both training from scratch and sparsity finetuning, as well as one radical modification that only applies to from-scratch training. Another under-testing module for both sparsity and flatness is also immediate from our theories. Validational experiments are conducted to verify our explanation. Experiments for productivity demonstrate modifications' improvement in sparsity, indicating further theoretical cost reduction in both training and inference.

翻译：MLP层中激活稀疏性的近期实验观察为大幅降低计算开销提供了免费契机。尽管已有研究将其归因于训练动力学，但关于激活稀疏性涌现的理论解释仍局限于浅层网络、小规模训练步数以及经过修改的训练方案——尽管该稀疏性现象已在采用标准协议进行大规模训练的深层模型中被发现。为填补这三项空白，我们提出梯度稀疏性作为激活稀疏性来源的概念，并基于此建立理论解释：梯度稀疏性及随后的激活稀疏性是实现关于隐层特征与参数的对抗鲁棒性的必要步骤，而对抗鲁棒性在良好学习的模型中近似等价于极小值点平坦性。该理论适用于经过标准训练的带LayerNorm的纯MLP模型，若在训练过程中向权重添加噪声，则可进一步推广至Transformer或其他架构。为在论证稀疏性必要性时排除其他平坦性来源，我们发现了谱集中现象——即权重矩阵最大非零奇异值与最小非零奇异值之比趋近于1。我们利用随机矩阵理论（RMT）作为强大的理论工具分析随机梯度噪声，并讨论谱集中现象的涌现机制。基于这些见解，我们提出两种即插即用模块（分别适用于从头训练和稀疏性微调），以及一种仅适用于从头训练的激进修改方案。另一项同时针对稀疏性与平坦性的待测试模块也直接源于我们的理论。通过验证性实验检验了理论解释的有效性。生产力实验表明这些修改方案显著提升了稀疏性，预示着在训练与推理阶段可进一步实现理论层面的计算成本降低。