A promising technique for exploration is to maximize the entropy of visited state distribution, i.e., state entropy, by encouraging uniform coverage of visited state space. While it has been effective for an unsupervised setup, it tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states to exploit the task reward. Such a preference can cause an imbalance between the distributions of high-value states and low-value states, which biases exploration towards low-value state regions as a result of the state entropy increasing when the distribution becomes more uniform. This issue is exacerbated when high-value states are narrowly distributed within the state space, making it difficult for the agent to complete the tasks. In this paper, we present a novel exploration technique that maximizes the value-conditional state entropy, which separately estimates the state entropies that are conditioned on the value estimates of each state, then maximizes their average. By only considering the visited states with similar value estimates for computing the intrinsic bonus, our method prevents the distribution of low-value states from affecting exploration around high-value states, and vice versa. We demonstrate that the proposed alternative to the state entropy baseline significantly accelerates various reinforcement learning algorithms across a variety of tasks within MiniGrid, DeepMind Control Suite, and Meta-World benchmarks. Source code is available at https://sites.google.com/view/rl-vcse.
翻译:探索的一种有前景的技术是通过鼓励对访问过的状态空间进行均匀覆盖来最大化访问状态分布的熵,即状态熵。尽管这种方法在无监督设置中效果显著,但在具有任务奖励的监督设置中往往表现不佳,因为智能体倾向于访问高价值状态以利用任务奖励。这种偏好可能导致高价值状态与低价值状态分布之间的不平衡,从而使得探索偏向低价值状态区域——因为当分布变得更加均匀时,状态熵会增加。当高价值状态在状态空间内分布狭窄时,这一问题会进一步加剧,导致智能体难以完成任务。本文提出一种新颖的探索技术,该方法最大化价值条件状态熵:首先分别估计以各状态价值估计值为条件的状态熵,然后最大化其平均值。通过仅考虑具有相似价值估计的访问状态来计算内在奖励,我们的方法防止了低价值状态的分布影响高价值状态周围的探索,反之亦然。实验证明,相较于状态熵基线方法,我们提出的替代方案在MiniGrid、DeepMind Control Suite和Meta-World基准测试的多种任务中,显著加速了各类强化学习算法的训练。源代码发布于https://sites.google.com/view/rl-vcse。