Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. Our code is publicly available at https://github.com/yk7333/DRND.
翻译:探索仍是深度强化学习中智能体在未知环境中获取高回报的关键问题。尽管主流的探索性随机网络蒸馏(RND)算法已在众多环境中展现出有效性,但其在奖励分配环节常缺乏足够的区分能力。本文揭示了RND中的"奖励不一致"问题,并指出其根本局限性。为解决该问题,我们提出分布随机网络蒸馏(DRND)——RND的衍生算法。DRND通过蒸馏随机网络的分布并隐式集成伪计数机制提升奖励分配的精确性,从而优化探索过程。这种改进促使智能体进行更广泛的探索。该方法在未引入显著计算开销的前提下有效缓解了奖励不一致问题。理论分析与实验结果均表明,本方法优于原始RND算法。该方法在具有挑战性的在线探索场景中表现卓越,并在D4RL离线任务中有效充当反探索机制。我们的代码已开源至https://github.com/yk7333/DRND。