A central task in control theory, artificial intelligence, and formal methods is to synthesize reward-maximizing strategies for agents that operate in partially unknown environments. In environments modeled by gray-box Markov decision processes (MDPs), the impact of the agents' actions are known in terms of successor states but not the stochastics involved. In this paper, we devise a strategy synthesis algorithm for gray-box MDPs via reinforcement learning that utilizes interval MDPs as internal model. To compete with limited sampling access in reinforcement learning, we incorporate two novel concepts into our algorithm, focusing on rapid and successful learning rather than on stochastic guarantees and optimality: lower confidence bound exploration reinforces variants of already learned practical strategies and action scoping reduces the learning action space to promising actions. We illustrate benefits of our algorithms by means of a prototypical implementation applied on examples from the AI and formal methods communities.
翻译:控制理论、人工智能和形式化方法中的核心任务是为在部分未知环境中运行的智能体综合收益最大化策略。在由灰盒马尔可夫决策过程(MDP)建模的环境中,智能体动作的影响可通过后继状态得知,但无法了解其中涉及的随机性。本文通过强化学习提出一种针对灰盒MDP的策略综合算法,该算法采用区间MDP作为内部模型。为应对强化学习中有限采样访问的挑战,我们在算法中融入两个创新概念,聚焦于快速且成功的学习而非随机性保证与最优性:低置信度边界探索强化了已学实用策略的变体,动作范围缩减将学习动作空间限制在具有前景的动作上。通过人工智能与形式化方法社区示例的原型实现,我们展示了所提算法的优势。