Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.
翻译:非均匀目标选择有潜力超越均匀随机选择,从而改进技能的强化学习。本文提出一种在内在动机目标条件强化学习中学习目标选择策略的方法:"多样性进展"。学习者基于在目标集上观察到的可区分性改进来构建课程。我们提出的方法适用于可区分性动机智能体类别,其内在奖励的计算函数取决于智能体对正在追求的真实目标的确定性程度。该奖励能够激励智能体在没有外部奖励的情况下学习一组多样化技能。我们通过实验证明,相较于先前方法,DP激励的智能体能够更快地学习一组可区分的技能,并且不会遭受目标分布崩溃的问题——这是某些先前方法存在的已知缺陷。最后,我们提出了将这一概念验证向前推进的计划。