Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "Planning Exploratory Goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command. Website: https://penn-pal-lab.github.io/peg/
翻译:被投放到未知环境中,智能体应如何快速了解环境并学会在其中完成多样化的任务?我们在目标条件强化学习框架中探讨这一问题,通过识别智能体在训练时应如何设定目标以最大化探索效能,提出"探索目标的规划"(PEG)方法。该方法针对每次训练回合设定目标,直接优化内在探索奖励。PEG首先选择目标指令,使得智能体在当前训练阶段的目标条件策略能到达具有高探索潜力的状态;随后从这些有希望的状态启动探索策略。为实现这一直接优化,PEG学习世界模型并适配基于采样的规划算法以"规划目标指令"。在包含多足蚂蚁机器人穿越迷宫、机械臂在杂乱桌面操作等具有挑战性的模拟机器人环境中,相较于基准方法和消融实验,PEG探索机制使目标条件策略的训练更高效、更有效。我们的蚂蚁成功穿越了长距离迷宫,机械臂也成功按照指令堆叠起三个积木块。网站:https://penn-pal-lab.github.io/peg/