Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.
翻译:摘要:强化学习算法通常要求马尔可夫决策过程(MDP,亦称受控马尔可夫链)中的状态与动作空间为有限集,而现有文献已致力于将该类算法推广至连续状态与动作空间。本文证明,在非常温和的正则条件下(特别是仅基于MDP转移核的弱连续性),通过量化状态与动作的Q学习(称为量化Q学习)在标准Borel MDP中收敛至极限,且该极限满足最优性方程,从而在显式性能界保证或渐近最优性保证下实现近最优性。我们的方法基于:(i) 将量化视为测量核,从而将量化MDP视为部分可观测马尔可夫决策过程(POMDP);(ii) 利用部分可观测马尔可夫决策过程Q学习的近最优性与收敛性结果;(iii) 最终建立弱连续核MDP的有限状态模型逼近的近最优性,并证明该近最优性对应于所构造POMDP的不动点。因此,本文为连续空间MDP的Q学习应用提供了极为普适的收敛性与逼近结果。