超越单GPU：将PDLP扩展至分布式多GPU系统 (Beyond Single-GPU: Scaling PDLP to Distributed Multi-GPU Systems)

In this work, we present a distributed implementation of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers have shown strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by GPU memory capacity and computational throughput. To overcome these challenges, we extend the PDHG framework to a distributed-memory setting via a practical two-dimensional grid partitioning of the constraint matrix, enabling scalable execution across multiple GPUs. Our implementation leverages the NCCL communication backend to efficiently synchronize primal-dual updates across devices. To improve load balance and computational efficiency, we introduce a block-wise random shuffling strategy combined with nonzero-aware data distribution, and further accelerate computation through fused CUDA kernels. By distributing both memory and computation, the proposed framework not only overcomes the single-GPU memory bottleneck but also achieves substantial speedups by exploiting multi-GPU parallelism with relatively low communication overhead. Extensive experiments on standard LP benchmarks, including MIPLIB and Hans' instances, as well as large-scale real-world datasets, show that our distributed implementation, built upon cuPDLPx, achieves strong scalability and high performance while preserving full FP64 numerical accuracy.

翻译：本文提出了一种用于求解大规模线性规划（LP）问题的原始-对偶混合梯度（PDHG）算法的分布式实现。尽管基于PDHG的求解器在单节点GPU架构上已展现出强劲性能，但其在工业级规模问题上的应用常受限于GPU内存容量与计算吞吐量。为应对这些挑战，我们通过对约束矩阵进行实用的二维网格划分，将PDHG框架扩展至分布式内存环境，实现了跨多GPU的可扩展执行。我们的实现利用NCCL通信后端高效同步跨设备的原始-对偶更新。为改善负载均衡与计算效率，我们引入了结合非零感知数据分布的块级随机混洗策略，并通过融合CUDA内核进一步加速计算。通过同时分布内存与计算，所提框架不仅突破了单GPU内存瓶颈，更以相对较低的通信开销利用多GPU并行性实现了显著加速。在标准LP基准测试集（包括MIPLIB与Hans实例）以及大规模真实数据集上的大量实验表明，基于cuPDLPx构建的分布式实现能够在保持完整FP64数值精度的同时，获得优异的可扩展性与高性能。