Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.
翻译:自然梯度因其快速收敛特性与协变权重更新机制,长期以来在深度强化学习领域备受关注。然而,计算自然梯度需要在每次迭代中求取Fisher信息矩阵(FIM)的逆矩阵,其计算复杂度本质上难以承受。本文提出一种高效可扩展的自然策略优化技术,该方法利用秩-1逼近来替代完整的逆FIM计算。我们在理论上证明,在特定条件下,逆FIM的秩-1逼近比策略梯度方法收敛更快,并且在某些条件下具有与随机策略梯度方法相同的样本复杂度。我们在多样化环境基准测试中验证了本方法的性能,结果表明其表现优于标准的行动者-评论者方法与信赖域基线方法。