Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
翻译:许多强化学习中的策略优化方法会引入与先前策略的 Kullback-Leibler (KL) 散度,以防止策略变化过快。这一思想最初在关于保守策略迭代的开创性论文中提出,并由 TRPO 和芒肖森值迭代 (MVI) 等算法给出了近似实现。我们沿着这一研究方向,通过研究一种广义 KL 散度——称为 Tsallis KL 散度——来推进工作,该散度在定义中使用了 $q$-对数。该方法是严格的泛化,因为 $q = 1$ 对应于标准 KL 散度;$q > 1$ 则提供了一系列新的选项。我们刻画了在 Tsallis KL 下学习到的策略类型,并论证了何时 $q > 1$ 可能有益。为了获得一个包含 Tsallis KL 正则化的实用算法,我们扩展了 MVI——这是纳入 KL 正则化的最简单方法之一。我们证明,在 35 个 Atari 游戏上,这种广义 MVI($q$) 相比于标准 MVI($q = 1$) 取得了显著的改进。