In offline reinforcement learning, deriving an effective policy from a pre-collected set of experiences is challenging due to the distribution mismatch between the target policy and the behavioral policy used to collect the data, as well as the limited sample size. Model-based reinforcement learning improves sample efficiency by generating simulated experiences using a learned dynamic model of the environment. However, these synthetic experiences often suffer from the same distribution mismatch. To address these challenges, we introduce SimuDICE, a framework that iteratively refines the initial policy derived from offline data using synthetically generated experiences from the world model. SimuDICE enhances the quality of these simulated experiences by adjusting the sampling probabilities of state-action pairs based on stationary DIstribution Correction Estimation (DICE) and the estimated confidence in the model's predictions. This approach guides policy improvement by balancing experiences similar to those frequently encountered with ones that have a distribution mismatch. Our experiments show that SimuDICE achieves performance comparable to existing algorithms while requiring fewer pre-collected experiences and planning steps, and it remains robust across varying data collection policies.
翻译:在离线强化学习中,由于目标策略与数据收集所用行为策略之间的分布不匹配以及样本数量有限,从预先收集的经验集中推导有效策略具有挑战性。基于模型的强化学习通过使用学习得到的环境动态模型生成模拟经验来提高样本效率。然而,这些合成经验往往存在相同的分布不匹配问题。为解决这些挑战,我们提出了SimuDICE框架,该框架利用世界模型生成的合成经验对从离线数据推导的初始策略进行迭代优化。SimuDICE通过基于平稳分布校正估计(DICE)和模型预测置信度估计来调整状态-动作对的采样概率,从而提升模拟经验的质量。该方法通过平衡频繁遇到的相似经验与存在分布不匹配的经验来指导策略改进。实验表明,SimuDICE在实现与现有算法相当性能的同时,需要更少的预收集经验和规划步骤,并且在不同数据收集策略下均保持鲁棒性。