Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.
翻译:缓解过高估计偏差是深度强化学习在复杂任务或包含分布外数据的离线数据集上取得优异性能的关键挑战。为克服过高估计偏差,研究者探索了利用多个Q函数多样性的Q学习集成方法。由于网络初始化是促进Q函数多样性的主要手段,文献中已研究了启发式多样性注入方法。然而,先前研究尚未从理论角度探索集成中可保证的独立性。通过引入基于随机矩阵理论的Q集成独立性正则化损失,我们提出用于强化学习的尖峰Wishart Q集成独立性正则化方法(SPQR)。具体而言,我们将Q集成独立性的不可计算假设检验准则转化为Q集成谱分布与目标维格纳半圆分布之间的可计算KL散度。我们在多种在线和离线集成Q学习算法中实现了SPQR。实验结果表明,SPQR在在线和离线强化学习基准测试中均优于基线算法。