Model-based offline reinforcement learning methods (RL) have achieved state-of-the-art performance in many decision-making problems thanks to their sample efficiency and generalizability. Despite these advancements, existing model-based offline RL approaches either focus on theoretical studies without developing practical algorithms or rely on a restricted parametric policy space, thus not fully leveraging the advantages of an unrestricted policy space inherent to model-based methods. To address this limitation, we develop MoMA, a model-based mirror ascent algorithm with general function approximations under partial coverage of offline data. MoMA distinguishes itself from existing literature by employing an unrestricted policy class. In each iteration, MoMA conservatively estimates the value function by a minimization procedure within a confidence set of transition models in the policy evaluation step, then updates the policy with general function approximations instead of commonly-used parametric policy classes in the policy improvement step. Under some mild assumptions, we establish theoretical guarantees of MoMA by proving an upper bound on the suboptimality of the returned policy. We also provide a practically implementable, approximate version of the algorithm. The effectiveness of MoMA is demonstrated via numerical studies.
翻译:基于模型的离线强化学习方法因其样本效率和泛化能力,在许多决策问题中取得了最先进的性能。尽管取得了这些进展,现有的基于模型的离线强化学习方法要么侧重于理论研究而未开发实用算法,要么依赖于受限的参数化策略空间,从而未能充分利用基于模型方法固有的非受限策略空间的优势。为解决这一限制,我们开发了MoMA——一种在离线数据部分覆盖下采用通用函数逼近的基于模型的镜像上升算法。MoMA通过采用非受限策略类别与现有文献区分开来。在每次迭代中,MoMA在策略评估步骤中通过在转移模型置信集内进行最小化过程来保守估计值函数,然后在策略改进步骤中使用通用函数逼近(而非常用的参数化策略类别)更新策略。在温和假设下,我们通过证明返回策略的次优性上界,为MoMA建立了理论保证。我们还提供了该算法的一个实际可实现的近似版本。通过数值研究验证了MoMA的有效性。