Trust Region Policy Optimization (TRPO) is an iterative method that simultaneously maximizes a surrogate objective and enforces a trust region constraint over consecutive policies in each iteration. The combination of the surrogate objective maximization and the trust region enforcement has been shown to be crucial to guarantee a monotonic policy improvement. However, solving a trust-region-constrained optimization problem can be computationally intensive as it requires many steps of conjugate gradient and a large number of on-policy samples. In this paper, we show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. The key idea is to generalize the surrogate objective used in TRPO in a way that a monotonic improvement guarantee still emerges as a result of constraining the maximum advantage-weighted ratio between policies. This new constraint outlines a conservative mechanism for iterative policy optimization and sheds light on practical ways to optimize the generalized surrogate objective. We show that the new constraint can be effectively enforced by being conservative when optimizing the generalized objective function in practice. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) as it is free of any explicit trust region constraints. Empirical results show that TREFree outperforms TRPO and Proximal Policy Optimization (PPO) in terms of policy performance and sample efficiency.
翻译:信任区域策略优化(TRPO)是一种迭代方法,它在每次迭代中同时最大化替代目标函数并对连续策略施加信任区域约束。研究表明,替代目标最大化与信任区域约束的结合对保证策略单调提升至关重要。然而,求解带信任区域约束的优化问题计算代价高昂,因为它需要多步共轭梯度法和大量同策略样本。本文证明,在不损害底层单调提升保证的前提下,策略上的信任区域约束可安全地替换为无信任区域约束。其核心思想是将TRPO中使用的替代目标函数进行泛化,使得对策略间最大优势加权比施加约束后,仍能获得单调提升保证。这一新约束勾勒出迭代策略优化的保守机制,并为实际优化广义替代目标函数提供了可行思路。我们证明,在实际优化广义目标函数时,通过采取保守策略可有效实施该新约束。由于该算法不含任何显式信任区域约束,我们将其命名为无信任区域策略优化(TREFree)。实验结果表明,TREFree在策略性能和样本效率方面均优于TRPO和近端策略优化(PPO)。