Many real-world decision-making tasks, such as safety-critical scenarios, cannot be fully described in a single-objective setting using the Markov Decision Process (MDP) framework, as they include hard constraints. These can instead be modeled with additional cost functions within the Constrained Markov Decision Process (CMDP) framework. Even though CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches use Monte Carlo cost estimates to avoid constraint violations. However, these suffer from high variance which results in conservative performance with respect to costs. We propose Constrained MCTS (C-MCTS), an algorithm that estimates cost using a safety critic. The safety critic training is based on Temporal Difference learning in an offline phase prior to agent deployment. This critic limits the exploration of the search tree and removes unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards compared to previous work. As a nice byproduct, the planner is more efficient requiring fewer planning steps. Most importantly, we show that under model mismatch between the planner and the real world, our approach is less susceptible to cost violations than previous work.
翻译:许多现实世界中的决策任务(如安全关键场景)无法完全在马尔可夫决策过程(MDP)框架下以单目标设定描述,因其包含硬约束。此类任务可借助带额外代价函数的约束马尔可夫决策过程(CMDP)框架建模。尽管CMDP在强化学习文献中已被广泛研究,但鲜有工作关注基于采样的规划算法(如MCTS)用于求解CMDP。现有方法使用蒙特卡洛代价估计以避免违反约束,但这些方法因高方差导致在代价方面表现保守。我们提出约束MCTS(C-MCTS),一种利用安全评估器估计代价的算法。该安全评估器在智能体部署前的离线阶段基于时序差分学习进行训练,在部署期间限制搜索树的探索并剔除MCTS中的不安全轨迹。C-MCTS在满足代价约束的同时更接近约束边界,相比先前工作获得更高奖励。其附带优势是规划器更高效,所需规划步数更少。最重要的是,我们证明在规划器与现实环境存在模型失配时,本方法相比先前工作更不易发生代价违规。