Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to -- or knowledge of -- an underlying, unobservable state space. Our metric, the $\lambda$-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD($\lambda$) with a different value of $\lambda$. Since TD($\lambda{=}0$) makes an implicit Markov assumption and TD($\lambda{=}1$) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the $\lambda$-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the $\lambda$-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different $\lambda$ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.
翻译:强化学习算法通常依赖于环境动态和价值函数可以用马尔可夫状态表示来表达的假设。然而,当状态信息仅部分可观测时,智能体如何学习这样的状态表示,以及如何检测何时找到了合适的表示?我们提出了一种能够同时实现这两个目标的度量方法,且无需访问或了解潜在的不可观测状态空间。我们的度量方法——$\lambda$-差异,是两个不同时序差分(TD)价值估计值之间的差异,每个估计值均使用具有不同$\lambda$值的TD($\lambda$)算法计算。由于TD($\lambda{=}0$)隐式地采用马尔可夫假设,而TD($\lambda{=}1$)则否,因此这两个估计值之间的差异可作为非马尔可夫状态表示的潜在指示器。事实上,我们证明了$\lambda$-差异在所有马尔可夫决策过程中恰好为零,而在广泛的局部可观测环境类别中几乎总是非零。我们还通过实验证明,一旦检测到非马尔可夫性,最小化$\lambda$-差异有助于学习记忆函数以缓解相应的部分可观测性问题。随后,我们训练了一个强化学习智能体,该智能体同时构建两个具有不同$\lambda$参数的循环价值网络,并将它们之间的差异作为辅助损失进行最小化。该方法可扩展至具有挑战性的部分可观测领域,在此类领域中,所得到的智能体性能通常显著优于(且从未差于)仅使用单一价值网络的基准循环智能体。