Transformer as Linear Expansion of Learngene

We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2x training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19x parameters stored to initialize these models and around 5x pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9x parameters stored to initialize, compared to the pre-training approach.

翻译：我们提出扩展共享Transformer模块以生成并初始化不同深度的Transformer，使其能够适应多样化的资源约束。类比基因的可扩展性，我们将此类模块称为学习基因（learngene）。为揭示扩展机制，我们深入探究层位置与其对应权重值之间的关系，发现线性函数可恰当近似这一关系。基于此洞察，我们提出Transformer作为学习基因的线性扩展（TLEG），这是一种灵活生成并初始化不同深度Transformer的新方法。具体而言，为学习基因，我们首先构建一个从学习基因线性扩展而来的辅助Transformer，随后通过软蒸馏技术对其进行训练。接着，通过线性扩展训练好的学习基因，我们可以生成并初始化不同深度的Transformer，从而支持多样的下游场景。在ImageNet-1K上的大量实验表明，TLEG相比许多从零训练的独立模型实现了相当或更优的性能，同时减少了约2倍的训练成本。在迁移至多个下游分类数据集时，TLEG大幅超越现有初始化方法（例如在iNat 2019上提升+6.87%，在CIFAR-100上提升+7.66%）。在需要生成适应不同资源约束的不同深度模型时，TLEG与预训练加微调方法相比，在实现相当结果的同时减少了约19倍用于初始化这些模型的存储参数和约5倍预训练成本。当迁移一组固定参数以初始化不同模型时，TLEG相比预训练方法展现出更好的灵活性和有竞争力的性能，同时减少了约2.9倍用于初始化的存储参数。