Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems which can efficiently train future large models. Accordingly, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication (Comp-vs.-Comm) scaling for future Transformer models on future hardware. First, our algorithmic analysis shows that compute generally enjoys an edge over communication as models scale. However, since memory capacity scales slower than compute, these trends are being stressed. Next, we quantify this edge by empirically studying how Comp-vs.-Comm scales for future models on future hardware. To avoid profiling numerous Transformer models across many setups, we extract execution regions and project costs using operator models. This allows a spectrum (hundreds) of future model/hardware scenarios to be accurately studied ($<$15% error), and reduces profiling costs by 2100$\times$. Our experiments show that communication will be a significant portion (40-75%) of runtime as models and hardware evolve. Moreover, communication which is hidden by overlapped computation in today's models often cannot be hidden in future, larger models. Overall, this work highlights the increasingly large role communication will play as models scale and discusses techniques and upcoming technologies that can help address it.
翻译:扩展神经网络模型已在各类机器学习问题中带来了显著的质量提升。然而,这种扩展同时增加了对高效分布式训练技术的依赖。因此,与其他分布式计算场景类似,理解随着模型规模扩大和硬件演进,计算与通信将如何相对扩展至关重要。针对这一问题的细致研究,能更好地指导未来能够高效训练大规模模型的系统设计。为此,本文对下一代硬件上的未来Transformer模型进行了计算与通信扩展的多维度(算法、经验、硬件演进)综合分析。首先,我们的算法分析表明:随着模型规模扩大,计算通常比通信更具优势。然而,由于内存容量扩展速度慢于计算能力,这种趋势正面临挑战。接着,我们通过经验研究量化了未来模型在下一代硬件上计算与通信扩展的差异。为避免对众多模型配置进行逐一性能剖析,我们提取执行区域并利用算子模型预估成本。这使得数百种未来模型/硬件场景(误差<15%)的精准研究成为可能,并将性能剖析成本降低2100倍。实验表明,随着模型与硬件演进,通信将占据运行时间的显著比例(40-75%)。此外,当前模型中通过重叠计算隐藏的通信开销,在未来更大规模模型中往往难以隐藏。总体而言,本文揭示了通信在模型扩展过程中日益重要的作用,并探讨了应对该挑战的技术手段与新兴技术方向。