Training state-of-the-art neural networks requires a high cost in terms of compute and time. Model scale is recognized to be a critical factor to achieve and improve the state-of-the-art. Increasing the scale of a neural network normally requires restarting from scratch by randomly initializing all the parameters of the model, as this implies a change of architecture's parameters that does not allow for a straightforward transfer of knowledge from smaller size models. In this work, we propose six composable transformations to incrementally increase the size of transformer-based neural networks while preserving functionality, allowing to expand the capacity of the model as needed. We provide proof of exact function preservation under minimal initialization constraints for each transformation. The proposed methods may enable efficient training pipelines for larger and more powerful models by progressively expanding the architecture throughout training.
翻译:训练最先进的神经网络需要高昂的计算和时间成本。模型规模被认为是实现和提升最先进性能的关键因素。增加神经网络的规模通常需要从头开始重新训练,随机初始化所有模型参数,因为这涉及架构参数的变化,无法从小规模模型中直接迁移知识。在这项工作中,我们提出了六种可组合的变换方法,用于逐步增大基于Transformer的神经网络规模,同时保持其功能不变,从而根据需要扩展模型容量。我们为每种变换提供了在最小初始化约束下精确保持功能的证明。所提出的方法可能通过在整个训练过程中逐步扩展架构,为更大、更强大的模型实现高效训练流程。