Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.
翻译:尽管缩放方法占据主导地位且效果显著,可生成含有数千亿参数的大规模网络,但训练过参数化模型的必要性仍未被充分理解,且训练成本呈指数级增长。本文探索将参数高效训练技术作为训练大型神经网络的一种方法。我们提出一种名为ReLoRA的新方法,通过低秩更新实现高秩网络的训练。我们将ReLoRA应用于训练参数量高达13亿的Transformer语言模型,并证明其性能与常规神经网络训练相当。ReLoRA每块GPU可节省高达5.5GB内存,并根据模型规模和硬件配置将训练速度提升9%-40%。我们的研究结果揭示了参数高效技术在大规模预训练中的潜力。