Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address this, existing uncertainty-based methods penalize the value function with uncertainty quantification and demand numerous ensemble networks, posing computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of $Q$-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of $Q$-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results also demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.
翻译:离线强化学习面临着分布偏移和不可靠的价值估计问题,尤其对于分布外动作。为解决这一问题,现有基于不确定性的方法通过不确定性量化对值函数进行惩罚,并需要大量集成网络,这带来了计算挑战和次优结果。本文提出了一种新颖策略,采用多样化随机值函数来估计$Q$值的后验分布。该方法提供了稳健的不确定性量化,并估计了$Q$值的下置信界。通过对分布外动作施加适度的价值惩罚,我们的方法形成了一种可证明的悲观方法。我们还强调了随机值函数中的多样性,并通过引入多样性正则化方法提高效率,减少了所需的网络数量。这些模块带来了可靠的价值估计和从离线数据中高效学习策略。理论分析表明,在线性马尔可夫决策过程假设下,我们的方法恢复了可证明高效的下置信界惩罚。大量实证结果也显示,我们提出的方法在性能和参数效率上显著优于基线方法。