In real life, success is often contingent upon multiple critical steps that are distant in time from each other and from the final reward. These critical steps are challenging to identify with traditional reinforcement learning (RL) methods that rely on the Bellman equation for credit assignment. Here, we present a new RL algorithm that uses offline contrastive learning to hone in on critical steps. This algorithm, which we call contrastive introspection (ConSpec), can be added to any existing RL algorithm. ConSpec learns a set of prototypes for the critical steps in a task by a novel contrastive loss and delivers an intrinsic reward when the current state matches one of these prototypes. The prototypes in ConSpec provide two key benefits for credit assignment: (1) They enable rapid identification of all the critical steps. (2) They do so in a readily interpretable manner, enabling out-of-distribution generalization when sensory features are altered. Distinct from other contemporary RL approaches to credit assignment, ConSpec takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon than it is to prospectively predict reward at every step taken in the environment. Altogether, ConSpec improves learning in a diverse set of RL tasks, including both those with explicit, discrete critical steps and those with complex, continuous critical steps.
翻译:在现实生活中,成功往往依赖于多个在时间上彼此远离且与最终奖励相隔甚远的关键步骤。这些关键步骤难以通过依赖贝尔曼方程进行信用分配的传统强化学习方法识别。本文提出一种新的强化学习算法,利用离线对比学习聚焦关键步骤。该算法称为对比内省(ConSpec),可集成至任何现有强化学习算法中。ConSpec通过一种新颖的对比损失函数学习任务中关键步骤的原型集,并在当前状态与任一原型匹配时提供内在奖励。ConSpec的原型为信用分配提供两大关键优势:(1) 能够快速识别所有关键步骤;(2) 以易于解释的方式实现此功能,当感官特征改变时支持分布外泛化。与当代其他强化学习信用分配方法不同,ConSpec利用了这样一个事实:回顾性地识别成功所依赖的少量步骤,比前瞻性地预测环境中每一步的奖励更为容易。总体而言,ConSpec在多种强化学习任务上提升了学习效果,包括具有显式离散关键步骤的任务以及具有复杂连续关键步骤的任务。