Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.
翻译:自然梯度因其快速收敛特性与协变权重更新机制,长期以来在深度强化学习领域备受关注。然而,计算自然梯度需要在每次迭代中求取Fisher信息矩阵(FIM)的逆矩阵,这在计算上具有本质性的困难。本文提出一种高效且可扩展的自然策略优化技术,该方法利用秩-1近似来替代完整的逆FIM计算。我们从理论上证明,在特定条件下,逆FIM的秩-1近似方法比策略梯度具有更快的收敛速度,并且在某些条件下能达到与随机策略梯度方法相同的样本复杂度。我们在多样化环境基准测试中验证了所提方法,结果表明其性能优于标准的演员-评论家方法与信赖域基线方法。