In spite of the large literature on reinforcement learning (RL) algorithms for partially observable Markov decision processes (POMDPs), a complete theoretical understanding is still lacking. In a partially observable setting, the history of data available to the agent increases over time so most practical algorithms either truncate the history to a finite window or compress it using a recurrent neural network leading to an agent state that is non-Markovian. In this paper, it is shown that in spite of the lack of the Markov property, recurrent Q-learning (RQL) converges in the tabular setting. Moreover, it is shown that the quality of the converged limit depends on the quality of the representation which is quantified in terms of what is known as an approximate information state (AIS). Based on this characterization of the approximation error, a variant of RQL with AIS losses is presented. This variant performs better than a strong baseline for RQL that does not use AIS losses. It is demonstrated that there is a strong correlation between the performance of RQL over time and the loss associated with the AIS representation.
翻译:尽管关于部分可观测马尔可夫决策过程(POMDPs)的强化学习(RL)算法已有大量文献,但完整的理论理解仍然缺乏。在部分可观测环境中,智能体可获取的历史数据随时间增长,因此大多数实际算法要么将历史截断为有限窗口,要么使用循环神经网络对其进行压缩,导致智能体状态具有非马尔可夫性。本文证明,尽管缺乏马尔可夫性质,表格设置下的循环Q-learning(RQL)仍然收敛。此外,收敛极限的质量取决于表示质量,该质量通过所谓的近似信息状态(AIS)进行量化。基于这种近似误差的表征,本文提出了一种带有AIS损失的RQL变体。该变体的性能优于不使用AIS损失的强基准RQL。研究证明,RQL随时间的性能与AIS表示相关的损失之间存在强相关性。