Constrained Markov decision processes (CMDPs), in which the agent optimizes expected payoffs while keeping the expected cost below a given threshold, are the leading framework for safe sequential decision making under stochastic uncertainty. Among algorithms for planning and learning in CMDPs, methods based on Monte Carlo tree search (MCTS) have particular importance due to their efficiency and extendibility to more complex frameworks (such as partially observable settings and games). However, current MCTS-based methods for CMDPs either struggle with finding safe (i.e., constraint-satisfying) policies, or are too conservative and do not find valuable policies. We introduce Threshold UCT (T-UCT), an online MCTS-based algorithm for CMDP planning. Unlike previous MCTS-based CMDP planners, T-UCT explicitly estimates Pareto curves of cost-utility trade-offs throughout the search tree, using these together with a novel action selection and threshold update rules to seek safe and valuable policies. Our experiments demonstrate that our approach significantly outperforms state-of-the-art methods from the literature.
翻译:约束马尔可夫决策过程(CMDPs)是随机不确定性下安全序列决策的主流框架,其核心在于智能体在优化期望收益的同时,将期望成本控制在给定阈值以下。在CMDP的规划与学习算法中,基于蒙特卡洛树搜索(MCTS)的方法因其高效性及向更复杂框架(如部分可观测场景与博弈)的可扩展性而尤为重要。然而,当前基于MCTS的CMDP方法要么难以找到安全(即满足约束的)策略,要么过于保守而无法发现高价值策略。本文提出阈值UCT(T-UCT),一种基于MCTS的在线CMDP规划算法。与以往基于MCTS的CMDP规划器不同,T-UCT在搜索树中显式估计成本-效用权衡的帕累托曲线,并结合新颖的动作选择与阈值更新规则,以寻求安全且高价值的策略。实验结果表明,该方法显著优于文献中的现有先进方法。