Exploration remains a critical challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short of practical effectiveness. In this paper, we introduce CAE, a lightweight algorithm that repurposes the value networks in standard deep RL algorithms to drive exploration without introducing additional parameters. CAE utilizes any linear multi-armed bandit technique and incorporates an appropriate scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and practical stability. Notably, it is simple to implement, requiring only around 10 lines of code. In complex tasks where learning an effective value network proves challenging, we propose CAE+, an extension of CAE that incorporates an auxiliary network. This extension increases the parameter count by less than 1% while maintaining implementation simplicity, adding only about 10 additional lines of code. Experiments on MuJoCo and MiniHack show that both CAE and CAE+ outperform state-of-the-art baselines, bridging the gap between theoretical rigor and practical efficiency.
翻译:探索仍然是强化学习中的一个关键挑战,因为许多现有方法要么缺乏理论保证,要么在实际效果上不足。本文提出CAE,一种轻量级算法,它重新利用标准深度强化学习算法中的价值网络来驱动探索,而无需引入额外参数。CAE利用任何线性多臂老虎机技术,并结合适当的缩放策略,实现了具有可证明的次线性遗憾界和实际稳定性的高效探索。值得注意的是,该算法实现简单,仅需约10行代码。在学习有效价值网络具有挑战性的复杂任务中,我们进一步提出CAE+,这是CAE的扩展版本,它引入了一个辅助网络。该扩展增加的参数量不到1%,同时保持了实现的简洁性,仅需额外约10行代码。在MuJoCo和MiniHack上的实验表明,CAE和CAE+均优于最先进的基线方法,从而弥合了理论严谨性与实际效率之间的差距。