Modern deep learning models, growing larger and more complex, have demonstrated exceptional generalization and accuracy due to training on huge datasets. This trend is expected to continue. However, the increasing size of these models poses challenges in training, as traditional centralized methods are limited by memory constraints at such scales. This paper proposes an asynchronous decentralized training paradigm for large modern deep learning models that harnesses the compute power of regular heterogeneous PCs with limited resources connected across the internet to achieve favourable performance metrics. Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters with similar data transfer rates and compute capabilities, without necessitating that each node hosts the entire model. These clusters engage in $\textit{Zero-Bubble Asynchronous Model Parallel}$ training, and a $\textit{Parallel Multi-Ring All-Reduce}$ method is employed to effectively execute global parameter averaging across all clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates and derived an optimal convergence rate of $O\left(\frac{1}{\sqrt{K}}\right)$. We further discuss linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.
翻译:现代深度学习模型日益庞大和复杂,由于在海量数据集上训练,展现出卓越的泛化能力和准确性。这一趋势预计将持续。然而,随着模型规模不断增大,传统集中式训练方法受限于内存约束,在如此量级下难以应对挑战。本文提出了一种针对大型现代深度学习模型的异步去中心化训练范式,该范式利用互联网连接、资源有限的普通异构PC的计算能力,以实现良好的性能指标。Ravnest通过将计算节点高效组织成数据传输速率和计算能力相似的集群,且无需每个节点托管完整模型,从而促进去中心化训练。这些集群进行零气泡异步模型并行训练,并采用并行多环全归约方法,在所有集群间有效执行全局参数平均。我们将异步SGD损失函数建模为带延迟更新的块结构优化问题,推导出最优收敛率为$O\left(\frac{1}{\sqrt{K}}\right)$。我们进一步讨论了相对于参与集群数量的线性加速以及延迟参数的界。