Large-Scale Regularized Matching on GPU Clusters

Production decision systems such as ad allocation or content matching involve millions of users and thousands of items, reducing to large-scale linear programs with sparse block-diagonal structure across users. These LPs are solved repeatedly on recurring cadences over slowly evolving inputs. Three system gaps stand out. Scale: production instances routinely exceed the memory capacity of GPU solvers such as cuPDLP and D-PDLP under fixed hardware budgets. Temporal instability: solution variability across runs induces downstream churn and complicates SLAs, yet existing solvers provide no explicit control. Extensibility: CPU-based solvers such as DuaLip-Scala converge slowly and couple problem formulation to fixed schemas, making new constraint families difficult to express. We present a distributed multi-GPU LP solver built natively in PyTorch with systems-algorithm co-design for this structure. It adopts column-sharded parallelism with fused Triton kernels and batched operations to reduce per-iteration overhead. As users grow, only local computation increases, while communication is limited to a reduction of item-level dual variables, yielding near-linear scaling with GPU count at fixed item size. We also adopt ridge-regularized LPs to improve stability, a control absent from existing GPU solvers. A continuation schedule over the regularization parameter balances convergence speed and solution fidelity. Finally, we introduce an operator-centric programming model that replaces DuaLip-Scala's schema-bound interface with composable primitives, enabling new formulations without modifying the solve loop or distributed infrastructure. On synthetic workloads, our system achieves order-of-magnitude wall-clock speedup over DuaLip-Scala, near-linear multi-GPU scaling (3.86x on 4 GPUs), and scales beyond the reach of existing GPU solvers.

翻译：生产决策系统（如广告分配或内容匹配）涉及数百万用户和数千个物品，这可以归结为具有稀疏块对角结构的大规模线性规划问题。这些线性规划问题在缓慢演化的输入数据上以周期性节奏重复求解。现有系统存在三个关键缺口：规模上，在固定硬件预算下，生产级实例通常超出cuPDLP和D-PDLP等GPU求解器的内存容量；时间不稳定性方面，求解结果在不同运行间的变异性会引发下游波动并复杂化服务等级协议，而现有求解器缺乏显式控制手段；可扩展性方面，DuaLip-Scala等基于CPU的求解器收敛缓慢，且将问题建模与固定模式耦合，导致新约束族难以表达。我们提出一种针对该结构的分布式多GPU线性规划求解器，原生构建于PyTorch之上，采用系统与算法协同设计。该求解器采用列分片并行策略，融合Triton内核与批处理操作以降低每次迭代开销。随着用户数量增长，仅有局部计算量增加，通信量则限制为物品级对偶变量的规约操作，在固定物品规模下实现了随GPU数量近乎线性的扩展性。我们还采用岭正则化线性规划来提升稳定性——这是现有GPU求解器所缺失的控制手段。通过正则化参数的连续调度策略，在收敛速度与解保真度之间取得平衡。最后，我们引入面向算子的编程模型，用可组合原语替代DuaLip-Scala的模式绑定接口，使得无需修改求解循环或分布式基础设施即可表达新问题公式。在合成工作负载上，我们的系统较DuaLip-Scala实现了数量级的墙钟加速、近乎线性的多GPU扩展性（4GPU达3.86倍），并能扩展到现有GPU求解器无法企及的规模。