Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models. In this paper, we explore low-rank training techniques as an alternative approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training. Furthermore, we observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Our findings shed light on the potential of low-rank training techniques and their implications for scaling laws.
翻译:尽管扩展规模(导致拥有数千亿参数的大型网络)占据主导地位且效果显著,但训练过参数化模型的必要性仍未被很好理解,且替代方法未必能降低训练高性能模型的成本。在本文中,我们探索低秩训练技术作为训练大型神经网络的替代方法。我们提出了一种名为ReLoRA的新方法,该方法利用低秩更新来训练高秩网络。我们将ReLoRA应用于训练参数多达3.5亿的Transformer语言模型,并证明其性能与常规神经网络训练相当。此外,我们观察到ReLoRA的效率随模型规模增大而提升,这使其成为高效训练数十亿参数网络的一种有前景的方法。我们的发现揭示了低秩训练技术的潜力及其对扩展定律的启示。