The performance of offline reinforcement learning (RL) suffers from the limited size and quality of static datasets. Model-based offline RL addresses this issue by generating synthetic samples through a dynamics model to enhance overall performance. To evaluate the reliability of the generated samples, uncertainty estimation methods are often employed. However, model ensemble, the most commonly used uncertainty estimation method, is not always the best choice. In this paper, we propose a \textbf{S}earch-based \textbf{U}ncertainty estimation method for \textbf{M}odel-based \textbf{O}ffline RL (SUMO) as an alternative. SUMO characterizes the uncertainty of synthetic samples by measuring their cross entropy against the in-distribution dataset samples, and uses an efficient search-based method for implementation. In this way, SUMO can achieve trustworthy uncertainty estimation. We integrate SUMO into several model-based offline RL algorithms including MOPO and Adapted MOReL (AMOReL), and provide theoretical analysis for them. Extensive experimental results on D4RL datasets demonstrate that SUMO can provide more accurate uncertainty estimation and boost the performance of base algorithms. These indicate that SUMO could be a better uncertainty estimator for model-based offline RL when used in either reward penalty or trajectory truncation. Our code is available and will be open-source for further research and development.
翻译:离线强化学习的性能受限于静态数据集的大小和质量。模型化离线强化学习通过动态模型生成合成样本以提升整体性能,从而缓解这一问题。为评估生成样本的可靠性,常采用不确定性估计方法。然而,最常用的不确定性估计方法——模型集成——并非总是最优选择。本文提出一种基于搜索的模型化离线强化学习不确定性估计方法(SUMO)作为替代方案。SUMO通过计算合成样本与分布内数据集样本的交叉熵来表征其不确定性,并采用高效的基于搜索的方法实现。该方法能够实现可信的不确定性估计。我们将SUMO集成至包括MOPO与Adapted MOReL(AMOReL)在内的多种模型化离线强化学习算法中,并提供了相应的理论分析。在D4RL数据集上的大量实验结果表明,SUMO能够提供更精确的不确定性估计,并提升基础算法的性能。这表明无论是用于奖励惩罚还是轨迹截断,SUMO都可能成为模型化离线强化学习中更优的不确定性估计器。我们的代码已公开并将开源,以支持进一步的研究与开发。