Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $\lambda_{\max}$. A large $\lambda_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $\lambda_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.
翻译:尽管在深度学习社区中广泛存在,过参数化模型在恰当训练时对计算成本提出了高要求。本研究通过细粒度、模块级别的学习动态分析过参数化模型,旨在实现更高效且富有成效的训练策略。实证证据表明,当缩放至网络模块(例如自注意力模型中的头部)时,我们可以观察到与每个模块可训练性隐式关联的不同学习模式。为了描述这种模块级别的学习能力,我们引入了一个称为模块神经正切核(mNTK)的新概念,并证明了模块的学习质量与其mNTK的主特征值$\lambda_{\max}$紧密相关。较大的$\lambda_{\max}$表示模块以更优收敛性学习特征,而那些较小的$\lambda_{\max}$可能对泛化产生负面影响。受此发现启发,我们提出了一种称为模块自适应训练(MAT)的新训练策略,该策略选择性地更新那些$\lambda_{\max}$超过动态阈值的模块,使模型集中于学习共同特征而忽略不一致特征。与大多数现有需要跨所有网络模块执行完整反向传播(BP)周期的训练方案不同,MAT通过其部分更新策略显著节省计算量,并可进一步提升性能。实验表明,MAT几乎将模型训练的计算成本减半,且准确率优于基线方法。