We propose a novel approach to the problem of controller design for environments modeled as Markov decision processes (MDPs). Specifically, we consider a hierarchical MDP a graph with each vertex populated by an MDP called a "room". We first apply deep reinforcement learning (DRL) to obtain low-level policies for each room, scaling to large rooms of unknown structure. We then apply reactive synthesis to obtain a high-level planner that chooses which low-level policy to execute in each room. The central challenge in synthesizing the planner is the need for modeling rooms. We address this challenge by developing a DRL procedure to train concise "latent" policies together with PAC guarantees on their performance. Unlike previous approaches, ours circumvents a model distillation step. Our approach combats sparse rewards in DRL and enables reusability of low-level policies. We demonstrate feasibility in a case study involving agent navigation amid moving obstacles.
翻译:我们提出了一种针对马尔可夫决策过程(MDP)建模环境的控制器设计新方法。具体而言,我们考虑一种层次化MDP结构,该结构以图为框架,图中每个顶点填充有称为"房间"的MDP。首先,我们应用深度强化学习(DRL)获取每个房间的低层策略,可扩展至未知结构的大型房间。随后通过反应式综合方法获取高层规划器,该规划器负责选择每个房间待执行的低层策略。合成规划器的核心挑战在于对房间建模的需求。我们通过开发一种DRL流程来应对这一挑战,该流程能够训练简洁的"隐式"策略,并为其性能提供PAC保证。与先前方法不同,我们的方法避免了模型蒸馏步骤。该方法有效缓解了DRL中的稀疏奖励问题,并实现了低层策略的可复用性。我们通过一个涉及移动障碍物环境中智能体导航的案例研究验证了其可行性。