Value-based reinforcement-learning algorithms have shown strong results in games, robotics, and other real-world applications. Overestimation bias is a known threat to those algorithms and can lead to dramatic performance decreases or even complete algorithmic failure. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the $T$-Estimator (TE) based on two-sample testing for the mean, that flexibly interpolates between over- and underestimation by adjusting the significance level of the underlying hypothesis tests. A generalization, termed $K$-Estimator (KE), obeys the same bias and variance bounds as the TE while relying on a nearly arbitrary kernel function. We introduce modifications of $Q$-Learning and the Bootstrapped Deep $Q$-Network (BDQN) using the TE and the KE, and prove convergence in the tabular setting. Furthermore, we propose an adaptive variant of the TE-based BDQN that dynamically adjusts the significance level to minimize the absolute estimation bias. All proposed estimators and algorithms are thoroughly tested and validated on diverse tasks and environments, illustrating the bias control and performance potential of the TE and KE.
翻译:基于价值的强化学习算法在游戏、机器人及其他实际应用中展现出显著成效。过估计偏差是这些算法面临的已知威胁,可能导致性能急剧下降甚至算法完全失效。我们将该偏差问题置于统计学框架下进行建模,并将其视为估计随机变量集合最大期望值(MEV)的实例。提出基于双样本均值检验的$T$估计器(TE),通过调整基础假设检验的显著性水平,灵活实现过估计与欠估计之间的平滑插值。其推广形式$K$估计器(KE)在保持与TE相同偏差与方差界的同时,可依赖于近乎任意的核函数。我们利用TE与KE对$Q$学习算法及引导式深度$Q$网络(BDQN)进行改进,并证明其在表格型环境下的收敛性。进一步提出基于TE的自适应BDQN变体,通过动态调整显著性水平以最小化绝对估计偏差。所有提出的估计器与算法均在多样化任务与环境上经过充分测试验证,展示了TE与KE的偏差控制能力及性能潜力。