Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the ``bonus inconsistency'' issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks.
翻译:探索仍是在未知环境中实现智能体高回报深度强化学习的关键问题。尽管当前主流的随机网络蒸馏(RND)算法已在众多环境中展现有效性,但其在奖励分配过程中的辨别能力仍显不足。本文揭示了RND算法中存在的"奖励不一致性"问题,并精准定位该问题的主要局限性。为解决此问题,我们提出分布随机网络蒸馏(DRND)——RND算法的衍生版本。DRND通过蒸馏随机网络分布并隐式融合伪计数机制,提升奖励分配的精确度,从而优化探索过程。该改进鼓励智能体进行更广泛的探索,且未引入显著计算开销。理论分析与实验结果表明,本方法在性能上显著优于原始RND算法,既能高效应对具有挑战性的在线探索场景,又能在D4RL离线任务中有效发挥反探索机制作用。