Visual reinforcement learning (RL) has shown promise in continuous control tasks. Despite its progress, current algorithms are still unsatisfactory in virtually every aspect of the performance such as sample efficiency, asymptotic performance, and their robustness to the choice of random seeds. In this paper, we identify a major shortcoming in existing visual RL methods that is the agents often exhibit sustained inactivity during early training, thereby limiting their ability to explore effectively. Expanding upon this crucial observation, we additionally unveil a significant correlation between the agents' inclination towards motorically inactive exploration and the absence of neuronal activity within their policy networks. To quantify this inactivity, we adopt dormant ratio as a metric to measure inactivity in the RL agent's network. Empirically, we also recognize that the dormant ratio can act as a standalone indicator of an agent's activity level, regardless of the received reward signals. Leveraging the aforementioned insights, we introduce DrM, a method that uses three core mechanisms to guide agents' exploration-exploitation trade-offs by actively minimizing the dormant ratio. Experiments demonstrate that DrM achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including DeepMind Control Suite, MetaWorld, and Adroit. Most importantly, DrM is the first model-free algorithm that consistently solves tasks in both the Dog and Manipulator domains from the DeepMind Control Suite as well as three dexterous hand manipulation tasks without demonstrations in Adroit, all based on pixel observations.
翻译:摘要:视觉强化学习(RL)在连续控制任务中展现出潜力。尽管取得了进展,现有算法在样本效率、渐近性能以及对随机种子选择的鲁棒性等几乎所有性能方面仍不尽人意。本文识别出现有视觉RL方法的一个主要缺陷,即智能体在早期训练中常表现出持续不活跃状态,从而限制其有效探索能力。基于这一关键发现,我们进一步揭示了智能体倾向于运动不活跃探索与其策略网络中神经元活动缺失之间的显著关联。为量化这种不活跃性,我们采用休眠比率作为衡量RL智能体网络非活跃程度的指标。实验表明,休眠比率可作为独立于奖励信号的智能体活动水平表征。基于上述洞察,我们提出DrM方法,通过三种核心机制主动最小化休眠比率,引导智能体的探索-利用权衡。实验证明,DrM在三个连续控制基准环境(包括DeepMind Control Suite、MetaWorld和Adroit)中,以76个种子全无失败的结果,实现了样本效率和渐近性能的显著提升。最重要的是,DrM是首个基于像素观测,在DeepMind Control Suite的Dog和Manipulator领域以及Adroit中三个无需示教的灵巧手操作任务上均能稳定求解的无模型算法。