Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explores recent advancements in training systems for LLMs, including innovations in training infrastructure with AI accelerators, networking, storage, and scheduling. Additionally, the survey covers parallelism strategies, as well as optimizations for computation, communication, and memory in distributed LLM training. It also includes approaches of maintaining system reliability over extended training periods. By examining current innovations and future directions, this survey aims to provide valuable insights towards improving LLM training systems and tackling ongoing challenges. Furthermore, traditional digital circuit-based computing systems face significant constraints in meeting the computational demands of LLMs, highlighting the need for innovative solutions such as optical computing and optical networks.
翻译:以GPT和LLaMA为代表的大规模语言模型(LLMs)正以其强大的能力引领人工智能领域的变革。训练此类模型需要庞大的GPU集群与可观的计算时间,在可扩展性、效率与可靠性方面均构成重大挑战。本综述系统探讨了LLM训练系统的最新进展,涵盖基于AI加速器的训练基础设施创新、网络架构、存储方案与调度策略。同时,本文详细分析了分布式LLM训练中的并行化策略,以及在计算、通信与内存方面的优化技术,并总结了长周期训练中维持系统可靠性的方法。通过梳理当前创新与未来方向,本综述旨在为改进LLM训练系统、应对持续存在的挑战提供有价值的见解。此外,传统基于数字电路的计算系统在满足LLM计算需求方面面临显著制约,这凸显了光学计算与光网络等创新解决方案的必要性。