Current approaches to model-based offline reinforcement learning often incorporate uncertainty-based reward penalization to address the distributional shift problem. These approaches, commonly known as pessimistic value iteration, use Monte Carlo sampling to estimate the Bellman target to perform temporal difference-based policy evaluation. We find out that the randomness caused by this sampling step significantly delays convergence. We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. The resulting algorithm, which we call Moment Matching Offline Model-Based Policy Optimization (MOMBO), propagates the uncertainty of the next state through a nonlinear Q-network in a deterministic fashion by approximating the distributions of hidden layer activations by a normal distribution. We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches. We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks.
翻译:当前基于模型的离线强化学习方法通常采用基于不确定性的奖励惩罚来解决分布偏移问题。这些方法通常被称为悲观值迭代,使用蒙特卡洛采样来估计贝尔曼目标以执行基于时序差分的策略评估。我们发现,由该采样步骤引入的随机性会显著延迟收敛。我们提出了一个理论结果,证明了次优性对每次贝尔曼目标计算所采用的蒙特卡洛样本数量的强依赖性。我们的主要贡献是使用渐进矩匹配(一种最初为确定性变分推断开发的方法)对贝尔曼目标进行确定性近似。由此产生的算法,我们称之为矩匹配离线模型策略优化(MOMBO),通过用正态分布近似隐藏层激活的分布,以确定性方式将下一状态的不确定性通过非线性Q网络传播。我们证明,与现有的蒙特卡洛采样方法相比,可以为MOMBO的次优性提供更严格的保证。在一系列基准任务中,我们还观察到MOMBO比这些方法收敛得更快。