DuaLip-GPU Technical Report

Large-scale linear programs (LPs) arise in many decision systems, including ranking, allocation, and matching problems that must be solved repeatedly at massive scale. Prior work such as ECLIPSE and LinkedIn's open-source DuaLip showed that ridge-regularized dual ascent with first-order methods can scale to these settings. However, the original implementation was tightly coupled to a small number of schemas and built on a CPU-centric Scala/Spark stack, limiting extensibility and preventing effective use of modern accelerators. We present a redesigned solver architecture that decouples problem specification from the optimization engine and targets GPU execution. The system uses an operator-centric programming model in which LP formulations are expressed through composable primitives for dual objective evaluation and blockwise projection operators for decomposable constraint families. This design allows new formulations to be added locally while reusing a shared optimization loop, diagnostics, and distributed infrastructure. To realize the available parallelism, we develop GPU execution techniques tailored to sparse matching constraints, including constraint-aligned sparse layouts, batched projection kernels, and a distributed design that communicates only dual variables. Further, we improve the underlying ridge-regularized dual ascent method with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter. On extreme-scale matching workloads, the GPU implementation achieves at least a 10x wall-clock speedup over the prior distributed CPU DuaLip solver under matched stopping criteria, while maintaining convergence guarantees.

翻译：大规模线性规划（LP）问题广泛存在于许多决策系统中，包括需要在大规模场景下反复求解的排序、分配和匹配问题。先前的研究（如ECLIPSE和LinkedIn的开源项目DuaLip）表明，采用一阶方法的岭正则化对偶上升法能够适应此类大规模场景。然而，原始实现与少数特定模式紧密耦合，并基于以CPU为中心的Scala/Spark技术栈构建，这限制了系统的可扩展性，并阻碍了对现代加速器的有效利用。本文提出了一种重新设计的求解器架构，该架构将问题描述与优化引擎解耦，并以GPU执行为目标。系统采用以算子为中心的编程模型，其中线性规划问题通过可组合的原语来表达对偶目标评估，并利用面向可分解约束族的块状投影算子进行建模。这一设计使得新问题的描述可以在本地添加，同时复用共享的优化循环、诊断工具和分布式基础设施。为充分利用可用并行性，我们开发了针对稀疏匹配约束的GPU执行技术，包括约束对齐的稀疏存储布局、批量投影内核以及一种仅通信对偶变量的分布式设计。此外，我们改进了基础的岭正则化对偶上升法，引入了雅可比风格的行归一化、原始变量缩放以及正则化参数的延拓策略。在极端规模的匹配任务上，在满足相同停止准则的条件下，GPU实现相比先前的分布式CPU版DuaLip求解器获得了至少10倍的挂钟时间加速，同时保持了收敛性保证。