Policy Mirror Descent (PMD) is a powerful and theoretically sound methodology for sequential decision-making. However, it is not directly applicable to Reinforcement Learning (RL) due to the inaccessibility of explicit action-value functions. We address this challenge by introducing a novel approach based on learning a world model of the environment using conditional mean embeddings. Leveraging tools from operator theory we derive a closed-form expression of the action-value function in terms of the world model via simple matrix operations. Combining these estimators with PMD leads to POWR, a new RL algorithm for which we prove convergence rates to the global optimum. Preliminary experiments in finite and infinite state settings support the effectiveness of our method
翻译:策略镜像下降(PMD)是一种强大且理论完备的序列决策方法。然而,由于显式动作值函数的不可获取性,该方法无法直接应用于强化学习(RL)。我们通过引入一种基于条件均值嵌入学习环境世界模型的新方法来解决这一挑战。借助算子理论工具,我们通过简单的矩阵运算推导出动作值函数相对于世界模型的闭式表达式。将这些估计器与PMD相结合,形成了POWR算法——一种新型强化学习算法,我们证明了该算法能以收敛速率达到全局最优解。在有限状态与无限状态环境中的初步实验验证了我们方法的有效性。