WAVE: Weight Template for Adaptive Initialization of Variable-sized Models

The expansion of model parameters underscores the significance of pre-trained models; however, the constraints encountered during model deployment necessitate models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address the initialization problem when target models are incompatible with pre-trained models. We tackle this issue from a multitasking perspective and introduce \textbf{WAVE}, which incorporates a set of shared \textbf{W}eight templates for \textbf{A}daptive initialization of \textbf{V}ariable-siz\textbf{E}d Models. During initialization, target models will initialize the corresponding weight scalers tailored to their model size, which are sufficient to learn the connection rules of weight templates based on the Kronecker product from a limited amount of data. For the construction of the weight templates, WAVE utilizes the \textit{Learngene} framework, which structurally condenses common knowledge from ancestry models into weight templates as the learngenes through knowledge distillation. This process allows the integration of pre-trained models' knowledge into structured knowledge according to the rules of weight templates. We provide a comprehensive benchmark for the learngenes, and extensive experiments demonstrate the efficacy of WAVE. The results show that WAVE achieves state-of-the-art performance when initializing models with various depth and width, and even outperforms the direct pre-training of $n$ entire models, particularly for smaller models, saving approximately $n\times$ and $5\times$ in computational and storage resources, respectively. WAVE simultaneously achieves the most efficient knowledge transfer across a series of datasets, specifically achieving an average improvement of 1.8\% and 1.2\% on 7 downstream datasets.

翻译：模型参数的扩展突显了预训练模型的重要性；然而，模型部署过程中遇到的约束条件要求模型具备可变尺寸。因此，当目标模型与预训练模型不兼容时，传统的预训练与微调范式无法解决初始化问题。我们从多任务学习的视角出发，提出了 **WAVE**，它包含一组共享的**权**重模板，用于可变尺**寸**模型的**自**适应**初**始化。在初始化过程中，目标模型将根据其模型尺寸初始化相应的权重缩放器，这些缩放器足以基于克罗内克积从有限数据中学习权重模板的连接规则。对于权重模板的构建，WAVE 采用 *Learngene* 框架，该框架通过知识蒸馏将祖先模型中的共性知识结构性地压缩为权重模板作为 learngenes。这一过程使得预训练模型的知识能够依据权重模板的规则整合为结构化知识。我们为 learngenes 提供了一个全面的基准测试，大量实验证明了 WAVE 的有效性。结果表明，WAVE 在初始化不同深度和宽度的模型时达到了最先进的性能，甚至优于直接预训练 $n$ 个完整模型，尤其对于较小模型，分别节省了约 $n$ 倍和 $5$ 倍的计算与存储资源。WAVE 同时在一系列数据集上实现了最高效的知识迁移，特别是在 7 个下游数据集上平均提升了 1.8% 和 1.2%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日