Passive observational data, such as human videos, is abundant and rich in information, yet remains largely untapped by current RL methods. Perhaps surprisingly, we show that passive data, despite not having reward or action labels, can still be used to learn features that accelerate downstream RL. Our approach learns from passive data by modeling intentions: measuring how the likelihood of future outcomes change when the agent acts to achieve a particular task. We propose a temporal difference learning objective to learn about intentions, resulting in an algorithm similar to conventional RL, but which learns entirely from passive data. When optimizing this objective, our agent simultaneously learns representations of states, of policies, and of possible outcomes in an environment, all from raw observational data. Both theoretically and empirically, this scheme learns features amenable for value prediction for downstream tasks, and our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
翻译:被动观测数据(如人类视频)虽然丰富且包含大量信息,但当前强化学习方法尚未充分挖掘其价值。令人意外的是,我们证明即便没有奖励或动作标签,被动数据仍可用于学习能加速下游强化学习的特征。我们的方法通过建模意图来学习被动数据,即衡量当智能体为实现特定任务而行动时,未来结果发生概率的变化。我们提出一种时序差分学习目标来学习意图,最终得到的算法与传统强化学习类似,但完全基于被动数据进行学习。在优化该目标时,智能体能同时从原始观测数据中学习环境中的状态表示、策略表示以及可能结果的表示。理论上和实验上,该方案均能学习到适用于下游任务价值预测的特征,我们的实验证明了该方法能够从多种形式的被动数据中学习,包括跨实体视频数据和YouTube视频。