The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.
翻译:针对部分可观测马尔可夫决策过程(POMDP)的标准方法,是将其转换为完全可观测的信念状态MDP。然而,信念状态依赖于系统模型,因此在强化学习(RL)场景中并不可行。一种广泛使用的替代方案是采用智能体状态,这是一种与模型无关、可递归更新的观测历史函数。典型实例包括帧堆叠和循环神经网络。由于智能体状态与模型无关,它被用于使标准RL算法适配POMDP。但Q学习等标准RL算法学习的是平稳策略。我们通过示例论证的核心论点是:由于智能体状态不满足马尔可夫性,基于智能体状态的非平稳策略能够超越平稳策略。为利用这一特性,我们提出PASQL(基于周期性智能体状态的Q学习),这是基于智能体状态的Q学习的一种变体,能够学习周期性策略。通过融合周期马尔可夫链与随机逼近的思想,我们严格证明了PASQL会收敛至循环极限,并刻画了收敛周期策略的近似误差。最后,我们通过数值实验展示PASQL的核心特性,并论证学习周期性策略相较于平稳策略的优势。