Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.
翻译:自然梯度因其快速收敛特性与协变权重更新机制,长期以来在深度强化学习领域备受关注。然而,计算自然梯度需要在每次迭代中求取费舍尔信息矩阵的逆矩阵,这本质上是计算不可行的。本文提出一种高效可扩展的自然策略优化技术,该方法利用秩-1近似来替代完整的逆费舍尔矩阵。我们在理论上证明:在特定条件下,逆费舍尔矩阵的秩-1近似比策略梯度收敛更快,并且在某些条件下具有与随机策略梯度方法相同的样本复杂度。我们在多样化环境集上对本方法进行基准测试,结果表明其性能优于标准的行动者-评论者方法与信赖域基线方法。