There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear simple, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.
翻译:求解马尔可夫决策问题(MDP)主要有两种方法:基于贝尔曼方程的动态规划以及线性规划(LP)。动态规划方法应用最为广泛,构成了经典与现代强化学习(RL)的理论基础。相比之下,基于LP的方法使用较少,尽管近年来在离线强化学习等场景中重新受到关注。基于LP的方法相对较少使用,是因为其会导出一个不等式约束优化问题,该问题通常比基于贝尔曼方程的方法更难高效求解。本文旨在为更高效、更实用地求解基于LP的MDP建立理论基础。我们的核心思想是利用不等式约束优化中广泛使用的对数障碍函数,将MDP的LP表述转化为无约束优化问题。这一重构使得能够通过梯度下降法轻松获得近似解。尽管该方法看似简单,但据我们所知,目前尚未形成对此方法的完整理论阐释。本文旨在填补这一空白。