Embedding parameterized optimization problems as layers into machine learning architectures serves as a powerful inductive bias. Training such architectures with stochastic gradient descent requires care, as degenerate derivatives of the embedded optimization problem often render the gradients uninformative. We propose Lagrangian Proximal Gradient Descent (LPGD) a flexible framework for training architectures with embedded optimization layers that seamlessly integrates into automatic differentiation libraries. LPGD efficiently computes meaningful replacements of the degenerate optimization layer derivatives by re-running the forward solver oracle on a perturbed input. LPGD captures various previously proposed methods as special cases, while fostering deep links to traditional optimization methods. We theoretically analyze our method and demonstrate on historical and synthetic data that LPGD converges faster than gradient descent even in a differentiable setup.
翻译:将参数化优化问题作为层嵌入机器学习架构中,可作为一种强大的归纳偏置。使用随机梯度下降训练此类架构时需要谨慎,因为嵌入式优化问题的退化导数常导致梯度信息缺失。我们提出拉格朗日近端梯度下降法(LPGD),这是一个用于训练含嵌入式优化层架构的灵活框架,可无缝集成到自动微分库中。LPGD通过在扰动输入上重新运行前向求解器,高效计算退化优化层导数的有效替代值。LPGD将多种先前提出的方法归纳为特例,同时与传统优化方法建立深刻联系。我们从理论上分析了该方法,并通过历史数据与合成数据证明,即使在可微分设置中,LPGD的收敛速度也优于梯度下降法。