Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of practical guidelines for RL practitioners.
翻译:表征是所有深度强化学习方法的核心,无论是针对马尔可夫决策过程(MDP)还是部分可观测马尔可夫决策过程(POMDP)。研究者已开发出众多表征学习方法和理论框架,旨在理解何为有效表征。然而,这些方法之间的关系及其共享特性仍不明确。本文证明,这些看似迥异的状态与历史抽象方法及框架,实际上都基于自预测抽象这一共同思想。此外,我们为广泛采用的目标函数与优化策略(如停止梯度技术)提供了自学自预测表征的理论洞见。这些发现共同催生了一种用于学习状态与历史自预测表征的极简算法。通过将该算法应用于标准MDP、含干扰项的MDP以及稀疏奖励的POMDP,我们验证了理论的有效性。最终,这些成果为强化学习从业者提炼出一套实用指南。