A significant bottleneck in applying current reinforcement learning algorithms to real-world scenarios is the need to reset the environment between every episode. This reset process demands substantial human intervention, making it difficult for the agent to learn continuously and autonomously. Several recent works have introduced autonomous reinforcement learning (ARL) algorithms that generate curricula for jointly training reset and forward policies. While their curricula can reduce the number of required manual resets by taking into account the agent's learning progress, they rely on task-specific knowledge, such as predefined initial states or reset reward functions. In this paper, we propose a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge. Our curriculum empowers the agent to autonomously reset to diverse and informative initial states. To achieve this, we introduce a success discriminator that estimates the success probability from each initial state when the agent follows the forward policy. The success discriminator is trained with relabeled transitions in a self-supervised manner. Our experimental results demonstrate that our ARL algorithm can generate an adaptive curriculum and enable the agent to efficiently bootstrap to solve sparse-reward maze navigation and manipulation tasks, outperforming baselines with significantly fewer manual resets.
翻译:当前强化学习算法在真实场景应用中的一个主要瓶颈是需要在每个回合之间重置环境。这一重置过程需要大量人工干预,使得智能体难以连续自主地学习。近年来的多项研究提出了自主强化学习算法,通过生成课程来联合训练重置策略与前向策略。虽然这些课程能够考虑智能体的学习进度以减少所需的人工重置次数,但它们依赖于任务特定知识,例如预定义的初始状态或重置奖励函数。本文提出一种新颖的自主强化学习算法,能够在无需任务特定知识的情况下生成适应智能体学习进度的课程。该课程使智能体能够自主重置到多样且信息丰富的初始状态。为实现这一目标,我们引入了一个成功判别器,用于在智能体遵循前向策略时估计每个初始状态的成功概率。该判别器通过自监督方式利用重新标记的转移数据进行训练。实验结果表明,我们的自主强化学习算法能够生成自适应课程,使智能体高效引导自身解决稀疏奖励的迷宫导航与操作任务,在显著减少人工重置次数的同时优于基线方法。