Emulating computationally intensive scientific simulations is crucial for enabling uncertainty quantification, optimization, and informed decision-making at scale. Gaussian Processes (GPs) offer a flexible and data-efficient foundation for statistical emulation, but their poor scalability limits applicability to large datasets. We introduce the Scaled Block Vecchia (SBV) algorithm for distributed GPU-based systems. SBV integrates the Scaled Vecchia approach for anisotropic input scaling with the Block Vecchia (BV) method to reduce computational and memory complexity while leveraging GPU acceleration techniques for efficient linear algebra operations. To the best of our knowledge, this is the first distributed implementation of any Vecchia-based GP variant. Our implementation employs MPI for inter-node parallelism and the MAGMA library for GPU-accelerated batched matrix computations. We demonstrate the scalability and efficiency of the proposed algorithm through experiments on synthetic and real-world workloads, including a 50M point simulation from a respiratory disease model. SBV achieves near-linear scalability on up to 512 A100 and GH200 GPUs, handles 2.56B points, and reduces energy use relative to exact GP solvers, establishing SBV as a scalable and energy-efficient framework for emulating large-scale scientific models on GPU-based distributed systems.
翻译:仿真计算密集型科学模拟对于实现大规模不确定性量化、优化及知情决策至关重要。高斯过程(GP)为统计仿真提供了灵活且数据高效的框架,但其扩展性不足限制了在大数据集上的应用。我们提出了面向分布式GPU系统的缩放块Vecchia(SBV)算法。SBV将各向异性输入缩放与块Vecchia(BV)方法相结合,在利用GPU加速技术实现高效线性代数运算的同时,降低了计算与内存复杂度。据我们所知,这是首个基于Vecchia的GP变体的分布式实现。我们的实现采用MPI进行节点间并行化,并通过MAGMA库实现GPU加速的批量矩阵运算。通过合成数据集及真实工作负载(包括一个包含5000万数据点的呼吸系统疾病模型模拟)的实验,我们验证了所提算法的可扩展性与效率。SBV在多达512个A100和GH200 GPU上实现近线性扩展,可处理25.6亿个数据点,并相比精确GP求解器降低了能耗,确立SBV作为在分布式GPU系统上仿真实大规模科学模型的可扩展且节能框架。