The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.
翻译:大型语言模型(LLM)的扩展推动了人们对矩阵优化器(如Shampoo、Muon、SOAP)的关注,因其收敛效率高;然而,其对整体更新的要求与Megatron等分布式框架中的张量碎片化相冲突。现有解决方案均非最优:同步方法存在计算冗余,而分层划分方法在不违反高效通信原语几何约束的前提下无法调和此冲突。为弥合这一差距,我们提出了Canzona,一个统一的、异步的、负载均衡的框架,它将逻辑优化器分配与物理参数分布解耦。对于数据并行,我们引入了一种α平衡静态分区策略,在尊重原子性的同时消除了负载不均衡。对于张量并行,我们设计了一种利用微组调度的异步计算流水线,以批处理碎片化更新并隐藏重构开销。在256个GPU上对Qwen3模型系列(参数规模高达320亿)进行的广泛评估表明,我们的方法保持了现有并行架构的效率,与基线相比,端到端迭代时间实现了1.57倍的加速,并将优化器步骤延迟降低了5.8倍。