We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. Compared with intensive research efforts in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions, and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward stochastic policy mirror descent (SPMD) method for solving AMDPs with and without regularizers and provide convergence guarantees in terms of the long-term average reward. For policy evaluation, existing on-policy methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies due to the lack of exploration in the action space. To remedy these issues, we develop a variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and design an exploratory VRTD method that resolves the exploration issue and provides comparable convergence guarantees. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models. It is worth noting that when linear function approximation is utilized, our algorithm only needs to update in the low-dimensional parameter space and thus can handle MDPs with large state and action spaces.
翻译:本研究聚焦于平均奖励马尔可夫决策过程(AMDPs),针对策略优化与策略评估问题提出了具有严格理论保证的新型一阶方法。相较于折扣MDP中策略梯度方法的有限样本分析已获得广泛研究,现有AMDP策略梯度方法的研究多局限于特定假设下的遗憾界分析,且普遍缺乏整体样本复杂度的理论保证。为此,我们提出了适用于正则化与非正则化AMDP的平均奖励随机策略镜像下降(SPMD)方法,并从长期平均奖励的角度提供了收敛性保证。在策略评估方面,现有同策略方法因动作空间探索不足,存在收敛速率次优及难以处理随机性不足策略的缺陷。为解决这些问题,我们针对随机策略提出了基于线性函数近似的方差缩减时序差分(VRTD)方法,并获得最优收敛保证;同时设计了探索型VRTD方法以解决探索不足问题,并提供可比的收敛保证。通过整合策略评估与策略优化模块,我们建立了生成模型与马尔可夫噪声模型下求解AMDP的样本复杂度理论。值得注意的是,当采用线性函数近似时,我们的算法仅需在低维参数空间进行更新,因而能够处理具有大规模状态与动作空间的MDP。