Exploration remains a fundamental challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short in practical effectiveness. In this paper, we propose CAE, i.e., the Critic as an Explorer, a lightweight approach that repurposes the value networks in standard deep RL algorithms to drive exploration, without introducing additional parameters. CAE leverages multi-armed bandit techniques combined with a tailored scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and strong empirical stability. Remarkably, it is simple to implement, requiring only about 10 lines of code. For complex tasks where learning reliable value networks is difficult, we introduce CAE+, an extension of CAE that incorporates an auxiliary network. CAE+ increases the parameter count by less than 1% while preserving implementation simplicity, adding roughly 10 additional lines of code. Extensive experiments on MuJoCo, MiniHack, and Habitat validate the effectiveness of CAE and CAE+, highlighting their ability to unify theoretical rigor with practical efficiency.
翻译:探索仍然是强化学习中的一个基本挑战,因为许多现有方法要么缺乏理论保证,要么在实际效果上不足。本文提出CAE(Critic as an Explorer),一种轻量级方法,通过重新利用标准深度强化学习算法中的价值网络来驱动探索,而无需引入额外参数。CAE结合多臂赌博机技术与定制化的缩放策略,实现了具有可证明的次线性遗憾界和强大实证稳定性的高效探索。值得注意的是,该方法实现简单,仅需约10行代码。对于学习可靠价值网络较为复杂的任务,我们进一步提出CAE+,该扩展方案通过引入辅助网络,在保持实现简洁性的同时仅增加不足1%的参数,额外代码量约为10行。在MuJoCo、MiniHack和Habitat平台上的大量实验验证了CAE与CAE+的有效性,突显了其融合理论严谨性与实践效率的优势。