In this work, we present a distributed implementation of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers have shown strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by GPU memory capacity and computational throughput. To overcome these challenges, we extend the PDHG framework to a distributed-memory setting via a practical two-dimensional grid partitioning of the constraint matrix, enabling scalable execution across multiple GPUs. Our implementation leverages the NCCL communication backend to efficiently synchronize primal-dual updates across devices. To improve load balance and computational efficiency, we introduce a block-wise random shuffling strategy combined with nonzero-aware data distribution, and further accelerate computation through fused CUDA kernels. By distributing both memory and computation, the proposed framework not only overcomes the single-GPU memory bottleneck but also achieves substantial speedups by exploiting multi-GPU parallelism with relatively low communication overhead. Extensive experiments on standard LP benchmarks, including MIPLIB and Hans' instances, as well as large-scale real-world datasets, show that our distributed implementation, built upon cuPDLPx, achieves strong scalability and high performance while preserving full FP64 numerical accuracy.
翻译:本文提出了一种用于求解大规模线性规划(LP)问题的原始-对偶混合梯度(PDHG)算法的分布式实现。尽管基于PDHG的求解器在单节点GPU架构上已展现出强劲性能,但其在工业级规模问题上的应用常受限于GPU内存容量与计算吞吐量。为应对这些挑战,我们通过对约束矩阵进行实用的二维网格划分,将PDHG框架扩展至分布式内存环境,实现了跨多GPU的可扩展执行。我们的实现利用NCCL通信后端高效同步跨设备的原始-对偶更新。为改善负载均衡与计算效率,我们引入了结合非零感知数据分布的块级随机混洗策略,并通过融合CUDA内核进一步加速计算。通过同时分布内存与计算,所提框架不仅突破了单GPU内存瓶颈,更以相对较低的通信开销利用多GPU并行性实现了显著加速。在标准LP基准测试集(包括MIPLIB与Hans实例)以及大规模真实数据集上的大量实验表明,基于cuPDLPx构建的分布式实现能够在保持完整FP64数值精度的同时,获得优异的可扩展性与高性能。