Scaling DNNs is shown to deliver dramatic quality gains across ML problems. This, however, has also led to a concomitant quadratic increase in computation cost. To tackle this, along with the failure of accelerator memory capacity to keep up, training these models increasingly relies on distributed training techniques. As such, an important question of interest is: how will compute and communication relatively scale as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems. To this end, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication (Comp-vs.-Comm) scaling for future Transformer models on future hardware. Using algorithmic analysis we show that compute generally enjoys an edge over communication as models scale. However, when viewed through the lens of slower memory capacity scaling, these trends are being stressed. Next, we craft an empirical strategy to study Comp-vs.-Comm scaling for future models/hardware using existing hardware. This allows hundreds of future models/hardware scenarios to be studied at three orders of magnitude lower profiling costs. Our experiments demonstrate that communication will be a significant portion (about 40-75%) of execution as models and hardware evolve, and communication which is today hidden by overlapped computation will likely get exposed. Further, the generality of our strategy makes it a strong basis to perform Comp-vs.-Comm scaling analysis for any future model. Overall, this work underscores the increasingly large role communication will play as models scale.
翻译:研究表明,扩展深度神经网络(DNN)能在各类机器学习问题中带来显著的质量提升。然而,这也导致了计算成本呈二次方增长。为应对此问题,并考虑到加速器内存容量难以同步增长,训练这些模型愈发依赖分布式训练技术。因此,一个关键问题是:随着模型规模扩大和硬件演进,计算与通信将如何相对扩展?对这一问题进行细致研究,能更好地指导未来系统的设计。为此,本文从算法、实验、硬件演进等多个维度,全面分析了未来Transformer模型在将来硬件上计算与通信(Comp-vs.-Comm)的规模扩展关系。通过算法分析,我们证明随着模型规模扩大,计算通常比通信更具优势。然而,从内存容量扩展较慢的角度来看,这些趋势正面临压力。接着,我们设计了一种利用现有硬件来研究未来模型/硬件下计算与通信规模扩展的实验策略。该策略能以低三个数量级的分析成本,研究数百种未来模型/硬件场景。实验表明,随着模型与硬件的演进,通信将占据执行时间的相当大比例(约40-75%),而当前被计算重叠隐藏的通信很可能变得显式化。此外,我们策略的通用性使其成为分析任何未来模型计算与通信规模扩展的坚实基础。总体而言,本文强调了随着模型规模扩展,通信将发挥越来越重要的作用。