With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
翻译:随着大语言模型(LLMs)的快速发展,研究者已开发出多种方法将计算与内存分布到硬件设备上,以实现高效的训练与推理。尽管现有综述对这些技术进行了描述性概述,但对其优势与权衡的系统性分析,以及如何将这些见解转化为设计最优分布式系统的原则性方法,仍然有限。本文对集合操作与分布式并行策略进行了全面回顾,并通过数学公式加以补充以深化理论理解。我们进一步研究了混合并行化设计,重点关注模型部署不同阶段(包括训练与推理)中通信与计算的重叠。文中还讨论了利用成本模型自动搜索最优混合并行化策略的最新进展。此外,我们通过主流架构类别的案例研究,揭示了指导研究者和实践者选择并行策略的经验性见解。最后,我们指出了当前LLM训练范式的开放挑战与局限,并展望了下一代大规模模型发展的前景方向。