Large Language Models (LLMs) have achieved remarkable success in various fields, but their training and finetuning require massive computation and memory, necessitating parallelism which introduces heavy communication overheads. Driven by advances in packaging, the chiplet architecture emerges as a potential solution, as it can integrate computing power, as well as utilize on-package links with better signal integrity, higher bandwidth, and lower energy consumption. However, most existing chiplet-related works focus on DNN inference. Directly porting them to LLM training introduces significantly large quantities of DRAM access and network-on-package (NoP) overheads which make state-of-the-art chiplet designs fail, highlighting a research gap. This work proposes Hecaton, a scalable and cost-effective chiplet system for LLM training and finetuning. We first provide a chiplet architecture with tailored scheduling that can largely reduce DRAM accesses. We further design an efficient distributed training method that reduces NoP communication complexity and relieves constraints on SRAM capacity and layout. Theoretical analysis shows that the entire system achieves weak scaling: as the workload and hardware resources grow proportionally, the computation-to-communication ratio remains nearly constant. Experiments with various workloads and hardware configurations verify the property, and Hecaton achieves $4.98\times$ performance improvement and $2.35\times$ energy reduction on Llama2-70B, compared to the tensor parallelism in Megatron. To the best of our knowledge, we propose the first chiplet architecture specifically used for LLM training or finetuning, with guaranteed performance regardless of the problem scale.
翻译:大语言模型(LLM)在诸多领域取得了显著成功,但其训练与微调过程需要巨大的计算与内存资源,必须采用并行化技术,而并行化又会引入沉重的通信开销。随着封装技术的进步,芯粒架构成为一种潜在的解决方案,它不仅能集成算力,还能利用封装内互连链路获得更优的信号完整性、更高带宽及更低能耗。然而,现有芯粒相关研究大多集中于深度神经网络推理。若将其直接迁移至LLM训练,将引入大量的DRAM访问与封装内网络开销,导致当前先进的芯粒设计方案失效,这凸显了该领域的研究空白。本文提出Hecaton,一种面向LLM训练与微调的可扩展、高性价比芯粒系统。我们首先提出一种结合定制化调度策略的芯粒架构,可大幅减少DRAM访问。进一步设计了一种高效的分布式训练方法,以降低封装内网络通信复杂度,并缓解对SRAM容量与布局的限制。理论分析表明,整个系统实现了弱可扩展性:当工作负载与硬件资源按比例增长时,计算与通信之比保持近似恒定。在不同工作负载与硬件配置下的实验验证了这一特性,与Megatron中的张量并行方案相比,Hecaton在Llama2-70B上实现了$4.98\times$的性能提升与$2.35\times$的能耗降低。据我们所知,本文首次提出了专用于LLM训练或微调的芯粒架构,其性能在不同问题规模下均能得到保障。