Despite the considerable attention given to the questions of \textit{how much} and \textit{how to} explore in deep reinforcement learning, the investigation into \textit{when} to explore remains relatively less researched. While more sophisticated exploration strategies can excel in specific, often sparse reward environments, existing simpler approaches, such as $\epsilon$-greedy, persist in outperforming them across a broader spectrum of domains. The appeal of these simpler strategies lies in their ease of implementation and generality across a wide range of domains. The downside is that these methods are essentially a blind switching mechanism, which completely disregards the agent's internal state. In this paper, we propose to leverage the agent's internal state to decide \textit{when} to explore, addressing the shortcomings of blind switching mechanisms. We present Value Discrepancy and State Counts through homeostasis (VDSC), a novel approach for efficient exploration timing. Experimental results on the Atari suite demonstrate the superiority of our strategy over traditional methods such as $\epsilon$-greedy and Boltzmann, as well as more sophisticated techniques like Noisy Nets.
翻译:尽管深度强化学习中“探索多少”和“如何探索”的问题已受到广泛关注,但对“何时探索”的研究仍相对不足。虽然更复杂的探索策略能在特定(通常是稀疏奖励)环境中表现出色,但现有的简单方法(如$\epsilon$-greedy)在更广泛的领域仍持续优于这些策略。这些简单策略的吸引力在于其易于实现且适用于多种领域,但本质上是完全忽略智能体内部状态的盲目切换机制。本文提出利用智能体内部状态决定“何时探索”,以解决盲目切换机制的缺陷。我们通过稳态机制提出价值差异与状态计数(VDSC)方法,这是一种实现高效探索时机的新颖方案。在Atari套件上的实验结果表明,我们的策略优于$\epsilon$-greedy和Boltzmann等传统方法,以及Noisy Nets等更复杂的技术。