Reinforcement learning (RL) is a powerful framework for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We address this challenge by incorporating prior model knowledge to guide exploration and accelerate the learning process. Specifically, we assume access to a model set that contains the true transition kernel and reward function. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also prove finite-time convergence to the optimal policy under mild assumptions. We demonstrate the effectiveness of the proposed exploration strategy, which we call BUMEX (Bounded Uncertainty Model-based Exploration), in a simulation study. The results indicate that the proposed method can significantly accelerate learning in benchmark examples. A toolbox is available at https://github.com/JvHulst/BUMEX.
翻译:强化学习(RL)是在不确定环境中进行决策的强大框架,但通常需要大量数据才能学习到最优策略。我们通过引入先验模型知识来指导探索并加速学习过程,以应对这一挑战。具体而言,我们假设存在一个包含真实转移核与奖励函数的模型集合。我们对该模型集合进行优化,以获取Q函数的上界与下界,进而用于指导智能体的探索。我们为所提出的探索策略类别提供了Q函数收敛至最优Q函数的理论保证。此外,我们还引入了一种数据驱动的正则化模型集合优化问题版本,确保探索策略类别能够收敛至最优策略。最后,我们证明当模型集合具有特定结构(即有界参数马尔可夫决策过程(BMDP)框架)时,正则化模型集合优化问题将转化为凸优化问题且易于实现。在此设定下,我们还在温和假设条件下证明了策略在有限时间内收敛至最优性。我们通过仿真研究验证了所提出的探索策略(称为BUMEX,即基于有界不确定性模型的探索)的有效性。结果表明,该方法在基准测试案例中能显著加速学习进程。相关工具箱已发布于 https://github.com/JvHulst/BUMEX。