MegaFold: Efficient Training of Next-Generation 3D Attention Protein Models on Cross-Platform GPUs

Recent advances in biomolecular modeling have been catalyzed by models such as AlphaFold3 (AF3), which introduce science-informed changes to the transformer architecture. Unlike transformers, a defining characteristic of AF3-style models is their 3D attention over 2D pairwise representations which produces tensors whose computation and memory costs scale cubically with sequence length. As a result, despite moderate parameter counts, AF3-style models are far more expensive to train than size-equivalent transformers, and are severely constrained by GPU memory capacity. Our characterization shows 3D attention fundamentally changes the training workload, causing massive 3D attention maps, complex inter-operator dependencies, kernel fragmentation, and heavy host-side data pipelines which differ substantially from LLM training, leading to poor utilization on modern GPU systems. Moreover, existing GPU optimizations do not adequately address these challenges due to complex cross-layer inter-operator dependencies introduced by 3D attention. Motivated by these challenges, we introduce MegaFold, a novel cross-platform system for efficient training of next-generation 3D-attention protein models. MegaFold combines a memory-efficient 3D-attention kernel, a communication-efficient sharding strategy for quadratic representations, fused operator implementations for critical execution paths, and a determinism-aware host-device pipeline that eliminates preprocessing stalls. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold enables training with up to 3.36$\times$ longer sequence lengths on 32 GPUs while reducing end-to-end execution time by up to 1.73$\times$ (NVIDIA) and 1.62$\times$ (AMD).

翻译：近期生物分子建模领域因AlphaFold3（AF3）等模型取得显著进展，此类模型对Transformer架构进行了基于科学认知的改进。与Transformer不同，AF3类模型的标志性特征在于对二维成对表示进行三维注意力计算，生成的张量在计算和内存开销上随序列长度呈立方级增长。因此，尽管参数量适中，但AF3类模型的训练成本远高于同等规模的Transformer，并严重受限于GPU显存容量。我们的特性分析表明，三维注意力从根本上改变了训练工作负载，产生了大规模三维注意力映射、复杂的算子间依赖、内核碎片化以及繁重的主机端数据流水线——这些特性与大型语言模型训练截然不同，导致现代GPU系统利用率低下。此外，现有GPU优化方案因三维注意力引入的复杂跨层算子间依赖关系，无法充分应对上述挑战。基于这些难题，我们提出MegaFold——面向新一代三维注意力蛋白质模型高效训练的跨平台系统。MegaFold融合了内存高效的三维注意力内核、面向二次表示的通信高效分片策略、关键执行路径的融合算子实现，以及消除预处理停滞的可确定性感知主机-设备流水线。在NVIDIA H200和AMD MI250 GPU上的评估表明，MegaFold可在32块GPU上支持序列长度提升高达3.36倍的训练，同时将端到端执行时间缩减最高1.73倍（NVIDIA）与1.62倍（AMD）。