We propose a novel generalization of constrained Markov decision processes (CMDPs) that we call the \emph{semi-infinitely constrained Markov decision process} (SICMDP). Particularly, we consider a continuum of constraints instead of a finite number of constraints as in the case of ordinary CMDPs. We also devise two reinforcement learning algorithms for SICMDPs that we call SI-CRL and SI-CPO. SI-CRL is a model-based reinforcement learning algorithm. Given an estimate of the transition model, we first transform the reinforcement learning problem into a linear semi-infinitely programming (LSIP) problem and then use the dual exchange method in the LSIP literature to solve it. SI-CPO is a policy optimization algorithm. Borrowing the ideas from the cooperative stochastic approximation approach, we make alternative updates to the policy parameters to maximize the reward or minimize the cost. To the best of our knowledge, we are the first to apply tools from semi-infinitely programming (SIP) to solve constrained reinforcement learning problems. We present theoretical analysis for SI-CRL and SI-CPO, identifying their iteration complexity and sample complexity. We also conduct extensive numerical examples to illustrate the SICMDP model and demonstrate that our proposed algorithms are able to solve complex sequential decision-making tasks leveraging modern deep reinforcement learning techniques.
翻译:我们提出了一种受约束马尔可夫决策过程(CMDP)的新泛化形式,称之为半无限约束马尔可夫决策过程(SICMDP)。具体而言,我们考虑的是连续约束而非传统CMDP中的有限约束。同时,我们为SICMDP设计了两种强化学习算法,分别称为SI-CRL和SI-CPO。SI-CRL是一种基于模型的强化学习算法。在得到转移模型估计后,我们首先将强化学习问题转化为线性半无限规划(LSIP)问题,随后采用LSIP文献中的对偶交换方法对其进行求解。SI-CPO是一种策略优化算法。借鉴协作随机逼近方法的思想,我们交替更新策略参数以实现奖励最大化或成本最小化。据我们所知,这是首次将半无限规划(SIP)工具应用于求解受约束强化学习问题。我们对SI-CRL和SI-CPO进行了理论分析,明确了其迭代复杂度与样本复杂度。此外,我们通过大量数值示例阐明了SICMDP模型,并证明了所提算法能够利用现代深度强化学习技术解决复杂的序贯决策任务。