Reinforcement learning (RL) is a powerful framework for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We address this challenge by incorporating prior model knowledge to guide exploration and accelerate the learning process. Specifically, we assume access to a model set that contains the true transition kernel and reward function. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also prove finite-time convergence to the optimal policy under mild assumptions. We demonstrate the effectiveness of the proposed exploration strategy, which we call BUMEX (Bounded Uncertainty Model-based Exploration), in a simulation study. The results indicate that the proposed method can significantly accelerate learning in benchmark examples. A toolbox is available at https://github.com/JvHulst/BUMEX.
翻译:强化学习(RL)是用于不确定环境下决策的强大框架,但其通常需要大量数据才能学习到最优策略。我们通过引入先验模型知识来指导探索并加速学习过程,以应对这一挑战。具体而言,我们假设存在一个包含真实转移核与奖励函数的模型集合。我们对该模型集合进行优化以获得Q函数的上界与下界,进而利用这些界值指导智能体的探索。我们为该类探索策略下Q函数向最优Q函数的收敛性提供了理论保证。此外,我们还提出了模型集合优化问题的数据驱动正则化版本,以确保该类探索策略收敛至最优策略。最后,我们证明当模型集合具有特定结构(即有界参数马尔可夫决策过程(BMDP)框架)时,正则化模型集合优化问题将转化为凸优化问题且易于实现。在此设定下,我们还在温和假设条件下证明了策略在有限时间内收敛至最优性。我们通过仿真研究验证了所提出的探索策略(称为BUMEX——基于有界不确定性模型的探索)的有效性。结果表明,该方法在基准测试案例中能显著加速学习进程。相关工具箱可在 https://github.com/JvHulst/BUMEX 获取。