Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD's full potential is limited, with the majority of research focusing on a particular mirror map -- namely, the negative entropy -- which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD's efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.
翻译:策略镜像下降(PMD)是强化学习中一个流行的框架,它作为一个统一视角涵盖了众多算法。这些算法通过选择镜像映射推导而来,并具有有限时间收敛保证。尽管PMD广受欢迎,但其全部潜力的探索仍然有限,大多数研究聚焦于特定的镜像映射——即负熵——这催生了著名的自然策略梯度(NPG)方法。现有理论研究尚未明确镜像映射的选择是否会显著影响PMD的效能。在本研究中,我们通过实证分析表明,传统镜像映射选择(NPG)在多个标准基准环境中常产生次优结果。利用进化策略,我们发现了能提升PMD性能的更高效镜像映射。我们首先关注表格环境(即Grid-World),将现有理论界限与几种标准镜像映射及学习所得映射的PMD性能相关联。随后我们证明,在更复杂的环境(如MinAtar套件)中,可以学习到优于负熵的镜像映射。此外,通过在不同环境中测试每个映射,我们证明所学镜像映射能有效泛化至不同任务。