In real life, success is often contingent upon multiple critical steps that are distant in time from each other and from the final reward. These critical steps are challenging to identify with traditional reinforcement learning (RL) methods that rely on the Bellman equation for credit assignment. Here, we present a new RL algorithm that uses offline contrastive learning to hone in on these critical steps. This algorithm, which we call Contrastive Retrospection (ConSpec), can be added to any existing RL algorithm. ConSpec learns a set of prototypes for the critical steps in a task by a novel contrastive loss and delivers an intrinsic reward when the current state matches one of the prototypes. The prototypes in ConSpec provide two key benefits for credit assignment: (i) They enable rapid identification of all the critical steps. (ii) They do so in a readily interpretable manner, enabling out-of-distribution generalization when sensory features are altered. Distinct from other contemporary RL approaches to credit assignment, ConSpec takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon (and ignoring other states) than it is to prospectively predict reward at every taken step. ConSpec greatly improves learning in a diverse set of RL tasks.
翻译:在现实生活中,成功往往依赖于多个关键步骤,这些步骤在时间上彼此间隔遥远,且与最终奖励相隔甚远。传统强化学习方法依赖贝尔曼方程进行信用分配,难以识别这些关键步骤。本文提出一种新的强化学习算法,该算法利用离线对比学习聚焦这些关键步骤。我们将其命名为"对比回顾"算法(ConSpec),该算法可集成至任何现有强化学习算法中。ConSpec通过新颖的对比损失函数学习任务关键步骤的原型集,并在当前状态与任一原型匹配时提供内在奖励。ConSpec的原型为信用分配带来两大关键优势:(i) 能够快速识别所有关键步骤;(ii) 以高度可解释的方式实现这一目标,从而在感官特征变化时实现分布外泛化。与当代其他基于信用分配的强化学习方法不同,ConSpec利用了以下事实:回顾性地识别成功所依赖的少量步骤(同时忽略其他状态)比前瞻性地预测每一步的奖励更容易。ConSpec显著提升了多类强化学习任务的学习效率。