Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the ``bonus inconsistency'' issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks.
翻译:探索问题仍然是深度强化学习中智能体在未知环境中获取高回报的关键。尽管当前广泛应用的随机网络蒸馏(RND)算法已在多种环境中被证明有效,但其在奖励分配中往往缺乏足够的区分能力。本文揭示了RND中的"奖励不一致性"问题,并明确了该算法的主要局限。为解决这一问题,我们提出了分布随机网络蒸馏(DRND)——RND的一种衍生算法。DRND通过对随机网络分布进行蒸馏,并隐式引入伪计数来提升奖励分配的精确性,从而增强探索过程。这种改进激励智能体进行更广泛的探索。我们的方法在不引入显著计算开销的情况下有效缓解了奖励不一致性问题。理论分析与实验结果均表明,该方法相较于原始RND算法具有显著优势。本方法在具有挑战性的在线探索场景中表现优异,并在D4RL离线任务中可作为有效的反探索机制。