Motivated by the novel paradigm developed by Van Roy and coauthors for reinforcement learning in arbitrary non-Markovian environments, we propose a related formulation and explicitly pin down the error caused by non-Markovianity of observations when the Q-learning algorithm is applied on this formulation. Based on this observation, we propose that the criterion for agent design should be to seek good approximations for certain conditional laws. Inspired by classical stochastic control, we show that our problem reduces to that of recursive computation of approximate sufficient statistics. This leads to an autoencoder-based scheme for agent design which is then numerically tested on partially observed reinforcement learning environments.
翻译:受到Van Roy及其合作者提出的在任意非马尔可夫环境下进行强化学习的新颖范式的启发,我们提出了一种相关的形式化方法,并明确界定了当在此形式化框架中应用Q学习算法时,由观测的非马尔可夫性所引起的误差。基于这一观察,我们提出智能体设计的准则应旨在寻求对某些条件律的良好近似。受经典随机控制理论的启发,我们证明了该问题可归结为对近似充分统计量的递归计算。这导致了一种基于自编码器的智能体设计方案,并在部分可观测的强化学习环境中进行了数值测试。