Constraint-Generation Policy Optimization (CGPO): Nonlinear Programming for Policy Optimization in Mixed Discrete-Continuous MDPs

We propose Constraint-Generation Policy Optimization (CGPO) for optimizing policy parameters within compact and interpretable policy classes for mixed discrete-continuous Markov Decision Processes (DC-MDPs). CGPO is not only able to provide bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics, but it can also provably derive optimal policies in cases where it terminates with zero error. Furthermore, CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions. To achieve such results, CGPO proposes a bi-level mixed-integer nonlinear optimization framework for optimizing policies within defined expressivity classes (i.e. piecewise (non)-linear) and reduces it to an optimal constraint generation methodology that adversarially generates worst-case state trajectories. Furthermore, leveraging modern nonlinear optimizers, CGPO can obtain solutions with bounded optimality gap guarantees. We handle stochastic transitions through explicit marginalization (where applicable) or chance-constraints, providing high-probability policy performance guarantees. We also present a road-map for understanding the computational complexities associated with different expressivity classes of policy, reward, and transition dynamics. We experimentally demonstrate the applicability of CGPO in diverse domains, including inventory control, management of a system of water reservoirs, and physics control. In summary, we provide a solution for deriving structured, compact, and explainable policies with bounded performance guarantees, enabling worst-case scenario generation and counterfactual policy diagnostics.

翻译：本文提出约束生成策略优化（CGPO）方法，用于在紧凑且可解释的策略类中优化混合离散-连续马尔可夫决策过程（DC-MDPs）的策略参数。CGPO不仅能为具有表达性非线性动力学的众多DC-MDPs提供无限初始状态范围内的有界策略误差保证，还能在零误差终止时确凿推导出最优策略。此外，CGPO可生成最坏情况状态轨迹以诊断策略缺陷，并提供最优行动的因果反事实解释。为实现上述结果，CGPO提出双层混合整数非线性优化框架，用于在定义的可表达性类别（即分段（非）线性）内优化策略，并将其简化为对抗性生成最坏情况状态轨迹的最优约束生成方法。进一步地，借助现代非线性优化器，CGPO可获得具有有界最优性间隙保证的解。我们通过显式边缘化（适用时）或机会约束处理随机转移过程，提供高概率策略性能保证。同时，本文提出理解策略、奖励和转移动力学不同可表达性类别相关计算复杂性的路线图。我们在库存控制、水库系统管理及物理控制等多个领域实验验证了CGPO的适用性。总之，本研究提供了一种可导出具有有界性能保证的结构化、紧凑且可解释策略的解决方案，支持最坏情况场景生成与反事实策略诊断。