Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
翻译:全分片数据并行(FSDP),也称为零冗余优化器(ZeRO),因其内存高效且对模型代码侵入性小,被广泛用于大规模模型训练。然而,现有FSDP系统依赖固定的逐元素或逐行分片格式,这与块结构计算存在冲突。因此,它们难以支持现代结构感知训练方法,包括块级量化和非逐元素优化器(如Shampoo和Muon)。此外,当前实现产生的通信和内存开销会降低数万张GPU规模下的效率。我们提出veScale-FSDP,一种新型FSDP系统,它将灵活的分片格式RaggedShard与结构感知规划算法相结合,同时实现灵活性与高性能。veScale-FSDP支持零拷贝FSDP通信,并原生支持块级量化和非逐元素优化器,相比现有FSDP系统吞吐量提升5%至66%、内存使用降低16%至30%,同时可高效扩展至数万张GPU。