While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $\epsilon$-greedy. Motivated by this, we introduce $\beta$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $\beta$. This function estimates the probability that each action has been taken at each state. By leveraging $\beta$, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. $\beta$-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that $\beta$-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.
翻译:尽管已有许多复杂的探索方法被提出,但其通用性不足和计算成本高昂往往使研究者倾向于选择如ε-贪婪策略等更简单的方法。受此启发,我们提出了β-DQN,这是一种简单高效的探索方法,通过行为函数β对标准DQN进行增强。该函数估计每个状态下每个动作被选择的概率。利用β函数,我们生成了一组多样化的策略,这些策略在状态-动作覆盖与高估偏差修正之间实现了探索平衡。我们设计了一个自适应元控制器,用于为每个训练回合选择有效策略,从而实现灵活且可解释的探索。β-DQN实现简单,仅对标准DQN增加了极小的计算开销。在简单及具有挑战性的探索领域上的实验表明,β-DQN在广泛任务中均优于现有基线方法,为改进深度强化学习中的探索提供了有效解决方案。