In large-scale distributed LLM training, communication between devices becomes the key performance bottleneck. Chiplet technology can integrate multiple dies into a package to scale-up node performance with higher bandwidth. Meanwhile, optical interconnect (OI) technology offers long-reach, high-bandwidth links, making it well suited for scale-out networks. The combination of these two technologies has the potential to overcome communication bottlenecks within and across packages. In this work, we present ChipLight, a cross-layer multi-objective design and optimization method for training clusters leveraging chiplet and OI. We first abstract an architecture model for such complex clusters, co-optimizing chiplet architecture, training parallel strategy, and OI network topology. Based on such models, we tailor the design space exploration flow by combining both black-box and white-box methodologies. Evaluated by our experimental results, ChipLight achieves significantly improved training efficiency and provides valuable design insights for the development of future training clusters.
翻译:在大规模分布式大语言模型训练中,设备间通信成为关键性能瓶颈。芯粒技术可通过将多个裸片集成至同一封装内,以更高带宽提升节点级性能;而光互连技术提供长距离、高带宽链路,十分适合构建扩展网络。两种技术的结合有望突破封装内部及封装间的通信瓶颈。本文提出ChipLight——一种面向训练集群的跨层多目标设计与优化方法,该集群融合了芯粒与光互连技术。我们首先为此类复杂集群抽象出架构模型,协同优化芯粒架构、训练并行策略与光互连网络拓扑。基于该模型,我们通过融合黑盒与白盒方法,定制设计空间探索流程。实验评估表明,ChipLight显著提升了训练效率,并为未来训练集群的发展提供了有价值的设计见解。