Recently, a simple yet effective algorithm -- goal-conditioned supervised-learning (GCSL) -- was proposed to tackle goal-conditioned reinforcement-learning. GCSL is based on the principle of hindsight learning: by observing states visited in previously executed trajectories and treating them as attained goals, GCSL learns the corresponding actions via supervised learning. However, GCSL only learns a goal-conditioned policy, discarding other information in the process. Our insight is that the same hindsight principle can be used to learn to predict goal-conditioned sub-goals from the same trajectory. Based on this idea, we propose Trajectory Iterative Learner (TraIL), an extension of GCSL that further exploits the information in a trajectory, and uses it for learning to predict both actions and sub-goals. We investigate the settings in which TraIL can make better use of the data, and discover that for several popular problem settings, replacing real goals in GCSL with predicted TraIL sub-goals allows the agent to reach a greater set of goal states using the exact same data as GCSL, thereby improving its overall performance.
翻译:近期,一种简单而有效的算法——目标条件监督学习(GCSL)被提出用于解决目标条件强化学习问题。GCSL基于事后学习原理:通过观测先前执行轨迹中访问过的状态并将其视为已实现目标,利用监督学习学习相应动作。然而,GCSL仅学习目标条件策略,丢弃了过程中的其他信息。我们的洞察在于,相同的事后原则可用于从同一轨迹中学习预测目标条件下的子目标。基于这一思想,我们提出轨迹迭代学习器(TraIL),作为GCSL的扩展,进一步挖掘轨迹中的信息,并将其用于预测动作和子目标的学习。我们研究了TraIL能更充分利用数据的场景,并发现对于多个常见问题设定,用预测的TraIL子目标替代GCSL中的真实目标,可使智能体在与GCSL完全相同的数据条件下到达更广范围的目标状态,从而提升整体性能。