The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives while maximizing cumulative reward. However, the current understanding of how to learn efficiently in a CMDP environment with a potentially infinite number of states remains under investigation, particularly when function approximation is applied to the value functions. In this paper, we address the learning problem given linear function approximation with $q_{\pi}$-realizability, where the value functions of all policies are linearly representable with a known feature map, a setting known to be more general and challenging than other linear settings. Utilizing a local-access model, we propose a novel primal-dual algorithm that, after $\tilde{O}(\text{poly}(d) \epsilon^{-3})$ queries, outputs with high probability a policy that strictly satisfies the constraints while nearly optimizing the value with respect to a reward function. Here, $d$ is the feature dimension and $\epsilon > 0$ is a given error. The algorithm relies on a carefully crafted off-policy evaluation procedure to evaluate the policy using historical data, which informs policy updates through policy gradients and conserves samples. To our knowledge, this is the first result achieving polynomial sample complexity for CMDP in the $q_{\pi}$-realizable setting.
翻译:约束马尔可夫决策过程(CMDP)框架作为一种重要的强化学习方法,在最大化累积奖励的同时施加安全性或其他关键约束。然而,当前对于在可能具有无限状态数的CMDP环境中如何高效学习的理解仍在探索中,尤其是在对价值函数应用函数逼近时。本文针对具有$q_{\pi}$可实现的线性函数逼近学习问题展开研究,其中所有策略的价值函数均可通过已知特征映射线性表示——这一设定被认为比其他线性设定更具一般性和挑战性。利用局部访问模型,我们提出了一种新颖的原对偶算法,该算法在$\tilde{O}(\text{poly}(d) \epsilon^{-3})$次查询后,能够以高概率输出一个严格满足约束条件,同时在奖励函数下近乎最优化的策略。此处$d$为特征维度,$\epsilon > 0$为给定误差。该算法依赖精心设计的离策略评估流程,利用历史数据评估策略,并通过策略梯度指导策略更新,从而节省样本消耗。据我们所知,这是在$q_{\pi}$可实现设定下首个实现CMDP多项式样本复杂度的研究成果。