The combination of deep reinforcement learning (DRL) with ensemble methods has been proved to be highly effective in addressing complex sequential decision-making problems. This success can be primarily attributed to the utilization of multiple models, which enhances both the robustness of the policy and the accuracy of value function estimation. However, there has been limited analysis of the empirical success of current ensemble RL methods thus far. Our new analysis reveals that the sample efficiency of previous ensemble DRL algorithms may be limited by sub-policies that are not as diverse as they could be. Motivated by these findings, our study introduces a new ensemble RL algorithm, termed \textbf{T}rajectories-awar\textbf{E} \textbf{E}nsemble exploratio\textbf{N} (TEEN). The primary goal of TEEN is to maximize the expected return while promoting more diverse trajectories. Through extensive experiments, we demonstrate that TEEN not only enhances the sample diversity of the ensemble policy compared to using sub-policies alone but also improves the performance over ensemble RL algorithms. On average, TEEN outperforms the baseline ensemble DRL algorithms by 41\% in performance on the tested representative environments.
翻译:深度强化学习与集成方法的结合已被证明在解决复杂序贯决策问题方面极为有效。这一成功主要归因于多模型的使用,既增强了策略的鲁棒性,也提高了价值函数估计的准确性。然而,目前对现有集成强化学习方法实证成功的分析仍十分有限。我们的新分析揭示,先前集成深度强化学习算法的样本效率可能受限于子策略多样性不足。基于这些发现,本研究提出了一种新型集成强化学习算法——**轨**迹感知**集**成探**索**(TEEN)。TEEN的核心目标是在促进更多样化轨迹的同时最大化期望回报。通过大量实验,我们证明TEEN不仅相较于单独使用子策略能提升集成策略的样本多样性,还能在集成强化学习算法基础上改进性能。在测试的典型环境中,TEEN的性能平均比基线集成深度强化学习算法高出41%。