Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $\lambda_{\max}$. A large $\lambda_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $\lambda_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.

翻译：尽管在深度学习社区中广泛存在，过参数化模型在恰当训练时对计算成本提出了高要求。本研究通过细粒度、模块级别的学习动态分析过参数化模型，旨在实现更高效且富有成效的训练策略。实证证据表明，当缩放至网络模块（例如自注意力模型中的头部）时，我们可以观察到与每个模块可训练性隐式关联的不同学习模式。为了描述这种模块级别的学习能力，我们引入了一个称为模块神经正切核（mNTK）的新概念，并证明了模块的学习质量与其mNTK的主特征值$\lambda_{\max}$紧密相关。较大的$\lambda_{\max}$表示模块以更优收敛性学习特征，而那些较小的$\lambda_{\max}$可能对泛化产生负面影响。受此发现启发，我们提出了一种称为模块自适应训练（MAT）的新训练策略，该策略选择性地更新那些$\lambda_{\max}$超过动态阈值的模块，使模型集中于学习共同特征而忽略不一致特征。与大多数现有需要跨所有网络模块执行完整反向传播（BP）周期的训练方案不同，MAT通过其部分更新策略显著节省计算量，并可进一步提升性能。实验表明，MAT几乎将模型训练的计算成本减半，且准确率优于基线方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日