While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.
翻译:最大熵(MaxEnt)强化学习(RL)框架——常因其探索与鲁棒性能力而备受推崇——通常从概率视角出发得到动机,但深度概率模型因其固有复杂性在实际应用中并未得到广泛采用。本研究提出在最大熵框架下采用隐变量策略,理论上可证明此类策略能逼近任意策略分布,且在使用具有隐信念状态的世界模型时会自然涌现。我们探讨了隐变量策略训练困难的原因及朴素方法的失效机制,随后引入一系列以低成本隐状态边缘化为核心的改进方案,从而以极小的附加成本充分利用隐状态。我们以演员-评论家框架实现该方法,同时对演员与评论家进行边缘化。最终得到的算法——称为随机边缘化演员-评论家(SMAC)——简洁而有效。我们在连续控制任务上进行了实验验证,结果表明有效的边缘化能够带来更好的探索效果与更鲁棒的训练过程。我们的实现已在 https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic 开源。