In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has $\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
翻译:本文考虑无限水平平均奖励马尔可夫决策过程(MDP)。与现有研究不同,本文方法利用了一般策略梯度算法的能力,摆脱了假设线性MDP结构的限制。我们提出了一种基于策略梯度的算法,并证明了其全局收敛性。随后,我们证明该算法具有$\tilde{\mathcal{O}}({T}^{3/4})$的遗憾值。值得注意的是,本文首次探索了平均奖励场景下一般参数化策略梯度算法的遗憾界计算,具有开创性意义。