We study reinforcement learning by combining recent advances in regularized linear programming formulations with the classical theory of stochastic approximation. Motivated by the challenge of designing algorithms that leverage off-policy data while maintaining on-policy exploration, we propose PGDA-RL, a novel primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs). PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem. The algorithm operates asynchronously, interacts with the environment through a single trajectory of correlated data, and updates its policy online in response to the dual variable associated with the occupancy measure of the underlying MDP. We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, we establish a last-iterate finite-time guarantee with $\tilde{O} (k^{-2/3})$ mean-square convergence, aligning with the best-known rates for two-timescale stochastic approximation methods under Markovian sampling and biased gradient estimates.
翻译:本研究通过结合正则化线性规划公式的最新进展与经典的随机逼近理论,对强化学习进行探索。针对设计既能利用离轨数据又能保持同轨探索的算法这一挑战,我们提出PGDA-RL——一种用于求解正则化马尔可夫决策过程的新型原始-对偶投影梯度下降-上升算法。PGDA-RL将基于经验回放的梯度估计与底层嵌套优化问题的双时间尺度分解相结合。该算法以异步方式运行,通过单条相关数据轨迹与环境交互,并根据底层MDP占用测度的对偶变量在线更新其策略。我们证明PGDA-RL几乎必然收敛到正则化MDP的最优值函数与最优策略。收敛性分析基于随机逼近理论工具,且所需假设弱于现有原始-对偶强化学习方法——特别地,无需模拟器或固定行为策略。在强化底层马尔可链遍历性假设的条件下,我们建立了末次迭代有限时间保证,获得$\tilde{O} (k^{-2/3})$均方收敛速率,这与马尔可夫采样和偏置梯度估计条件下双时间尺度随机逼近方法的最佳已知速率保持一致。