As the size of deep learning models continues to grow, finding optimal models under memory and computation constraints becomes increasingly more important. Although usually the architecture and constituent building blocks of neural networks allow them to be used in a modular way, their training process is not aware of this modularity. Consequently, conventional neural network training lacks the flexibility to adapt the computational load of the model during inference. This paper proposes SortedNet, a generalized and scalable solution to harness the inherent modularity of deep neural networks across various dimensions for efficient dynamic inference. Our training considers a nested architecture for the sub-models with shared parameters and trains them together with the main model in a sorted and probabilistic manner. This sorted training of sub-networks enables us to scale the number of sub-networks to hundreds using a single round of training. We utilize a novel updating scheme during training that combines random sampling of sub-networks with gradient accumulation to improve training efficiency. Furthermore, the sorted nature of our training leads to a search-free sub-network selection at inference time; and the nested architecture of the resulting sub-networks leads to minimal storage requirement and efficient switching between sub-networks at inference. Our general dynamic training approach is demonstrated across various architectures and tasks, including large language models and pre-trained vision models. Experimental results show the efficacy of the proposed approach in achieving efficient sub-networks while outperforming state-of-the-art dynamic training approaches. Our findings demonstrate the feasibility of training up to 160 different sub-models simultaneously, showcasing the extensive scalability of our proposed method while maintaining 96% of the model performance.
翻译:随着深度学习模型规模的持续增长,在内存和计算约束下寻找最优模型变得日益重要。尽管神经网络的架构和组成模块通常允许以模块化方式使用,但其训练过程并未意识到这种模块化特性。因此,传统神经网络训练缺乏在推理过程中动态调整模型计算负载的灵活性。本文提出SortedNet——一种通用且可扩展的解决方案,旨在从多个维度利用深度神经网络固有的模块化特性以实现高效动态推理。我们的训练方法考虑了子模型的嵌套架构(子模型共享参数),并以排序和概率方式将其与主模型协同训练。这种子网络排序训练使我们能够通过单轮训练将子网络数量扩展至数百个。在训练过程中,我们采用了一种结合子网络随机采样与梯度累积的新型更新方案,以提高训练效率。此外,训练过程的排序特性使得推理阶段无需搜索即可选择子网络;最终生成的子网络因其嵌套架构,在推理时仅需最小存储需求并支持子网络间高效切换。我们提出的通用动态训练方法已在多种架构和任务中得到验证,包括大语言模型和预训练视觉模型。实验结果表明,该方法在实现高效子网络的同时,性能优于当前最先进的动态训练方法。研究证明,该方法可同时训练多达160个不同子模型,在保持模型性能96%的前提下展现出卓越的可扩展性。