Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity

Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.

翻译：摘要：强化学习算法通常要求马尔可夫决策过程（MDP，亦称受控马尔可夫链）中的状态与动作空间为有限集，而现有文献已致力于将该类算法推广至连续状态与动作空间。本文证明，在非常温和的正则条件下（特别是仅基于MDP转移核的弱连续性），通过量化状态与动作的Q学习（称为量化Q学习）在标准Borel MDP中收敛至极限，且该极限满足最优性方程，从而在显式性能界保证或渐近最优性保证下实现近最优性。我们的方法基于：(i) 将量化视为测量核，从而将量化MDP视为部分可观测马尔可夫决策过程（POMDP）；(ii) 利用部分可观测马尔可夫决策过程Q学习的近最优性与收敛性结果；(iii) 最终建立弱连续核MDP的有限状态模型逼近的近最优性，并证明该近最优性对应于所构造POMDP的不动点。因此，本文为连续空间MDP的Q学习应用提供了极为普适的收敛性与逼近结果。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日