Understanding cellular machinery requires atomic-scale reconstruction of large biomolecular assemblies. However, predicting the structures of these systems has been constrained by hardware memory requirements of models like AlphaFold 3, imposing a practical ceiling of a few thousand residues that can be processed on a single GPU. Here we present NVIDIA BioNeMo Fold-CP, a context parallelism framework that overcomes this barrier by distributing the inference and training pipelines of co-folding models across multiple GPUs. We use the Boltz models as open source reference architectures and implement custom multidimensional primitives that efficiently parallelize both the dense triangular updates and the irregular, data-dependent pattern of window-batched local attention. Our approach achieves efficient memory scaling; for an N-token input distributed across P GPUs, per-device memory scales as $O(N^2/P)$, enabling the structure prediction of assemblies exceeding 30,000 residues on 64 NVIDIA B300 GPUs. We demonstrate the scientific utility of this approach through successful developer use cases: Fold-CP enabled the scoring of over 90% of Comprehensive Resource of Mammalian protein complexes (CORUM) database, as well as folding of disease-relevant PI4KA lipid kinase complex bound to an intrinsically disordered region without cropping. By providing a scalable pathway for modeling massive systems with full global context, Fold-CP represents a significant step toward the realization of a virtual cell.
翻译:理解细胞工作机制需要对大型生物分子组装体进行原子尺度重构。然而,预测此类系统的结构一直受限于如AlphaFold 3等模型的硬件内存需求,这导致在单GPU上可处理的残基数存在数千个的实际上限。本文提出NVIDIA BioNeMo Fold-CP,一种上下文并行框架,通过将共折叠模型的推理与训练流程分布至多个GPU,从而突破此限制。我们采用Boltz模型作为开源参考架构,并实现了自定义多维原语,可高效并行化稠密三角更新与窗口批处理局部注意力中不规则且数据依赖的模式。我们的方法实现了高效的内存扩展;对于一个分布在P个GPU上的N个令牌输入,每设备内存规模为$O(N^2/P)$,从而能够在64个NVIDIA B300 GPU上预测超过30,000个残基的组装体结构。我们通过成功的开发者用例展示了此方法的科学实用性:Fold-CP实现了对哺乳动物蛋白质复合物综合资源库中超过90%的复合物进行评分,并成功折叠了与固有无序区域结合且无需裁剪的疾病相关PI4KA脂质激酶复合物。通过为具有完整全局上下文的大规模系统建模提供可扩展路径,Fold-CP标志着向实现虚拟细胞迈出了重要一步。