In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has $\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
翻译:本文研究无限时域平均奖励马尔可夫决策过程(MDP)。与现有相关研究不同,本方法充分利用通用策略梯度算法的能力,摆脱了线性MDP结构假设的约束。我们提出一种基于策略梯度的算法,并证明其全局收敛性。进一步地,我们证明所提算法具有$\tilde{\mathcal{O}}({T}^{3/4})$的遗憾界。值得注意的是,本文首次探索了面向平均奖励场景下通用参数化策略梯度算法的遗憾界计算,具有开创性意义。