Hecaton：基于可扩展芯粒系统的大语言模型训练 (Hecaton: Training Large Language Models with Scalable Chiplet Systems)

Large Language Models (LLMs) have achieved remarkable success in various fields, but their training and finetuning require massive computation and memory, necessitating parallelism which introduces heavy communication overheads. Driven by advances in packaging, the chiplet architecture emerges as a potential solution, as it can integrate computing power, as well as utilize on-package links with better signal integrity, higher bandwidth, and lower energy consumption. However, most existing chiplet-related works focus on DNN inference. Directly porting them to LLM training introduces significantly large quantities of DRAM access and network-on-package (NoP) overheads which make state-of-the-art chiplet designs fail, highlighting a research gap. This work proposes Hecaton, a scalable and cost-effective chiplet system for LLM training. We first provide a chiplet architecture with tailored scheduling that can largely reduce DRAM accesses. We further design an efficient distributed training method that reduces NoP communication complexity and relieves constraints on SRAM capacity and layout. Theoretical analysis shows that the entire system achieves weak scaling: as the workload and hardware resources grow proportionally, the computation-to-communication ratio remains nearly constant. Experiments with various workloads and hardware configurations verify the property, and Hecaton achieves $5.29\times$ performance improvement and $3.46\times$ energy reduction on Llama3.1-405B, compared to the tensor parallelism in Megatron. To the best of our knowledge, we propose the first chiplet architecture specifically used for LLM training or finetuning, with guaranteed performance regardless of the problem scale.

翻译：大语言模型（LLMs）在诸多领域取得了显著成功，但其训练与微调需要海量计算与内存，必须采用并行化技术，而并行化又会引入沉重的通信开销。随着封装技术的进步，芯粒架构作为一种潜在解决方案应运而生，它既能集成算力，又能利用封装内互连链路获得更优的信号完整性、更高带宽及更低能耗。然而，现有芯粒相关研究大多集中于深度神经网络推理。将其直接迁移至LLM训练会引入巨量的DRAM访问和封装内网络开销，导致现有先进芯粒设计方案失效，这凸显了该领域的研究空白。本文提出Hecaton，一种面向LLM训练的可扩展、高性价比芯粒系统。我们首先提出一种结合定制化调度策略的芯粒架构，可大幅减少DRAM访问。进一步设计了一种高效的分布式训练方法，以降低封装内网络通信复杂度并缓解对SRAM容量与布局的限制。理论分析表明，整个系统实现了弱可扩展性：当工作负载与硬件资源按比例增长时，计算与通信之比保持基本恒定。多种工作负载与硬件配置下的实验验证了该特性，在Llama3.1-405B模型上，Hecaton相比Megatron中的张量并行方法实现了$5.29\times$的性能提升与$3.46\times$的能耗降低。据我们所知，本文首次提出了专用于LLM训练或微调的芯粒架构，并确保其性能不受问题规模影响。