We present two Policy Gradient-based methods with general parameterization in the context of infinite horizon average reward Markov Decision Processes. The first approach employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{3/5})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state of the art of the problem, which achieves a regret of $\tilde{\mathcal{O}}(T^{3/4})$.
翻译:我们提出了两种基于策略梯度的方法,采用一般参数化,应用于无限时域平均奖励马尔可夫决策过程。第一种方法采用隐式梯度传输进行方差缩减,确保了$\tilde{\mathcal{O}}(T^{3/5})$量级的预期遗憾。第二种方法基于黑塞矩阵技术,确保了$\tilde{\mathcal{O}}(\sqrt{T})$量级的预期遗憾。这些结果显著改进了该问题的最新技术水平,该技术水平此前实现的遗憾为$\tilde{\mathcal{O}}(T^{3/4})$。