We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. For this, we propose a model-based learning algorithm that also achieves zero constraint violations. Assuming that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures, we solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We use Bellman error-based analysis for tabular infinite-horizon setups which allows analyzing stochastic policies. Combining the Bellman error-based analysis and tighter optimization equation, for $T$ interactions with the environment, we obtain a high-probability regret guarantee for objective which grows as $\Tilde{O}(1/\sqrt{T})$, excluding other factors. The proposed method can be applied for optimistic algorithms to obtain high-probability regret bounds and also be used for posterior sampling algorithms to obtain a loose Bayesian regret bounds but with significant improvement in computational complexity.
翻译:我们考虑了具有凸约束的表格无限时域凹效用强化学习(CURL)问题。为此,我们提出了一种基于模型的学习算法,该算法同时实现了零约束违反。假设凹目标函数和凸约束在可行占优测度集合内部存在解,我们求解了一个更紧的优化问题,以确保尽管模型知识不精确且模型具有随机性,约束也永远不会被违反。我们使用基于贝尔曼误差的分析来处理表格无限时域设定,从而能够分析随机策略。结合基于贝尔曼误差的分析和更紧的优化方程,对于与环境交互的 $T$ 步操作,我们获得了目标函数的高概率遗憾保证,其增长率为 $\Tilde{O}(1/\sqrt{T})$(排除其他因素)。所提出的方法可应用于乐观算法以获得高概率遗憾界,也可用于后验采样算法以得到宽松的贝叶斯遗憾界,但计算复杂度可显著降低。