We study model-free reinforcement learning (RL) algorithms in episodic non-stationary constrained Markov Decision Processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a cumulative constraint on the expected utility (cost). In the non-stationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach for both tabular and linear approximation settings.
翻译:我们研究在片段式非平稳约束马尔可夫决策过程(CMDPs)中的无模型强化学习(RL)算法,其中智能体旨在最大化期望累积奖励,同时满足对期望效用(成本)的累积约束。在非平稳环境中,只要累积变化不超过特定变化预算,奖励、效用函数和转移核可随时间任意变化。我们首次提出了针对非平稳CMDPs的无模型、无模拟器RL算法,在表格和线性函数逼近两种设置下均具有亚线性遗憾和零约束违规,并具备可证明的性能保证。当总预算已知时,我们在表格情况下的遗憾界和约束违规结果与静态CMDPs的对应最优结果相匹配。此外,我们提出了一个通用框架来处理分析非平稳CMDPs的已知挑战,无需预先了解变化预算。我们将该方法应用于表格和线性逼近两种设置。