A fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features -- often matching or exceeding the performance achieved with hand-designed compact state information.
翻译:强化学习在马可决策过程中的一个基本假设是,相关的决策过程本身是马尔可夫的。然而,当 MDP 具有丰富的观测值时,智能体通常通过抽象状态表征来学习,而这种表征不能保证保留马尔可夫性质。我们引入了一组新的条件,并证明它们足以学习马尔可夫抽象状态表征。接着,我们描述了一种实用的训练过程,该过程结合逆模型估计和时间对比学习,以学习一种近似满足这些条件的抽象表达。我们的新训练目标同时适用于在线和离线训练:它不需要奖励信号,但智能体在奖励信息可用时可以加以利用。我们在一个视觉网格世界领域和一组连续控制基准上进行了实证评估。我们的方法学到的表征能够捕捉领域的底层结构,并在样本效率上优于基于视觉特征的先进深度强化学习方法——在多数情况下,匹配或超越了使用人工设计的紧凑状态信息所实现的性能。