Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to--or knowledge of--an underlying, unobservable state space. Our metric, the $\lambda$-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD($\lambda$) with a different value of $\lambda$. Since TD($\lambda$=0) makes an implicit Markov assumption and TD($\lambda$=1) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the $\lambda$-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the $\lambda$-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different $\lambda$ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.
翻译:强化学习算法通常依赖于环境动态和价值函数可基于马尔可夫状态表示进行描述的假设。然而,当状态信息仅部分可观测时,智能体应如何学习此类状态表示?又如何判断何时已获得有效的表示?本文提出一种能够同时实现这两个目标的度量指标,且无需依赖或了解潜在不可观测的状态空间。该指标称为λ-差异,即两个不同时序差分价值估计值之间的差异,其中每个估计值均采用具有不同λ参数的TD(λ)算法计算。由于TD(λ=0)隐式遵循马尔可夫假设而TD(λ=1)则否,二者估计值的差异可作为非马尔可夫状态表示的潜在指示器。我们严格证明:λ-差异在所有马尔可夫决策过程中恒为零,而在广泛类别的部分可观测环境中几乎总为非零。实验进一步表明,在检测到差异后,最小化λ-差异有助于学习记忆函数以缓解相应的部分可观测性问题。我们继而训练了一个强化学习智能体,该智能体同步构建两个具有不同λ参数的循环价值网络,并将二者差异最小化作为辅助损失。该方法可扩展至具有挑战性的部分可观测领域,在此类环境中,相较于仅使用单一价值网络的基准循环智能体,新构建的智能体在多数情况下表现显著更优(且从未表现更差)。