Exploring unknown environments efficiently is a fundamental challenge in unsupervised goal-conditioned reinforcement learning. While selecting exploratory goals at the frontier of previously explored states is an effective strategy, the policy during training may still have limited capability of reaching rare goals on the frontier, resulting in reduced exploratory behavior. We propose "Cluster Edge Exploration" ($CE^2$), a new goal-directed exploration algorithm that when choosing goals in sparsely explored areas of the state space gives priority to goal states that remain accessible to the agent. The key idea is clustering to group states that are easily reachable from one another by the current policy under training in a latent space and traversing to states holding significant exploration potential on the boundary of these clusters before doing exploratory behavior. In challenging robotics environments including navigating a maze with a multi-legged ant robot, manipulating objects with a robot arm on a cluttered tabletop, and rotating objects in the palm of an anthropomorphic robotic hand, $CE^2$ demonstrates superior efficiency in exploration compared to baseline methods and ablations.
翻译:在无监督目标条件强化学习中,高效探索未知环境是一个基础性挑战。尽管在已探索状态的前沿选择探索目标是一种有效策略,但训练过程中的策略可能仍难以到达前沿的稀有目标,从而导致探索行为受限。我们提出"簇边缘探索"算法($CE^2$),这是一种新的目标导向探索算法,当在状态空间稀疏探索区域选择目标时,会优先考虑智能体仍可到达的目标状态。其核心思想是通过聚类将当前训练策略下在潜在空间中彼此易于到达的状态进行分组,并在执行探索行为前先遍历至这些簇边界上具有显著探索潜力的状态。在包括多足蚂蚁机器人迷宫导航、机械臂在杂乱桌面上操作物体以及拟人机器人手掌旋转物体等具有挑战性的机器人环境中,$CE^2$相较于基线方法和消融实验均展现出更优越的探索效率。