Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones in order to transfer knowledge and accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether the most common workaround -- tuning on smaller models and extrapolating via hyperparameter scaling laws -- is still sound when using upscaling. We address this with principled approaches to upscaling with respect to model widths and efficiently tuning hyperparameters in this setting. First, motivated by $μ$P and any-dimensional architectures, we introduce a general upscaling method applicable to a broad range of architectures and optimizers, backed by theory guaranteeing that models are equivalent to their widened versions and allowing for rigorous analysis of infinite-width limits. Second, we extend the theory of $μ$Transfer to a hyperparameter transfer technique for models upscaled using our method and empirically demonstrate that this method is effective on realistic datasets and architectures.
翻译:现代大规模神经网络通常以多种尺寸进行训练和发布,以适应不同的推理预算需求。为提高效率,近期研究探索了模型上采样方法:通过已训练的小型模型初始化更大模型,以实现知识迁移并加速收敛。然而,该方法对超参数较为敏感,需要在目标上采样模型尺寸上进行调优,而直接调优的成本极高。目前尚不清楚最常见的解决方案——在小型模型上调优并通过超参数缩放律外推——在使用上采样时是否依然有效。我们针对模型宽度的上采样过程提出了原理性方法,并在此场景下实现了超参数的高效调优。首先,受μP理论与任意维度架构启发,我们提出了一种适用于广泛架构与优化器的通用上采样方法,其理论保证模型与其拓宽版本等效,并允许对无限宽度极限进行严格分析。其次,我们将μTransfer理论扩展为适用于本方法所上采样模型的超参数迁移技术,并通过实证验证该方法在现实数据集与架构上的有效性。