Efficient exploration in complex environments remains a major challenge for reinforcement learning (RL). Compared to previous Thompson sampling-inspired mechanisms that enable temporally extended exploration, i.e., deep exploration, we focus on deep exploration in distributional RL. We develop here a general purpose approach, Bag of Policies (BoP), that can be built on top of any return distribution estimator by maintaining a population of its copies. BoP consists of an ensemble of multiple heads that are updated independently. During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy, leading to distinct learning signals for each head which diversify learning and behaviour. To test whether optimistic ensemble method can improve on distributional RL as did on scalar RL, by e.g. Bootstrapped DQN, we implement the BoP approach with a population of distributional actor-critics using Bayesian Distributional Policy Gradients (BDPG). The population thus approximates a posterior distribution of return distributions along with a posterior distribution of policies. Another benefit of building upon BDPG is that it allows to analyze global posterior uncertainty along with local curiosity bonus simultaneously for exploration. As BDPG is already an optimistic method, this pairing helps to investigate if optimism is accumulatable in distributional RL. Overall BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.
翻译:在复杂环境中进行高效探索仍是强化学习(RL)面临的主要挑战。与以往受汤普森采样启发、支持时序扩展探索(即深度探索)的机制相比,我们聚焦于分布强化学习中的深度探索。本文提出一种通用方法——策略包(BoP),该方法可基于任意回报分布估计器,通过维护其多个副本实现。BoP包含多个独立更新的策略头集成。训练过程中,每个回合仅由一个策略头控制,收集的状态-动作对则用于离线更新所有策略头,从而为每个策略头生成不同的学习信号,实现学习过程与行为的多样化。为检验乐观集成方法能否像在标量RL中(如Bootstrapped DQN)一样提升分布RL性能,我们基于贝叶斯分布策略梯度(BDPG)方法,采用多个分布性Actor-Critic智能体实现BoP。该种群因此同时近似了回报分布的后验分布与策略的后验分布。基于BDPG的另一优势在于,它允许同时分析全局后验不确定性与局部好奇心奖励以促进探索。由于BDPG本身已是乐观方法,这种结合有助于探究分布RL中乐观性是否可累积。整体而言,BoP在ALE Atari游戏上的实验结果表明,该方法在学习过程中具有更强的鲁棒性和更快的收敛速度。