The ensemble method is a promising way to mitigate the overestimation issue in Q-learning, where multiple function approximators are used to estimate the action values. It is known that the estimation bias hinges heavily on the ensemble size (i.e., the number of Q-function approximators used in the target), and that determining the `right' ensemble size is highly nontrivial, because of the time-varying nature of the function approximation errors during the learning process. To tackle this challenge, we first derive an upper bound and a lower bound on the estimation bias, based on which the ensemble size is adapted to drive the bias to be nearly zero, thereby coping with the impact of the time-varying approximation errors accordingly. Motivated by the theoretic findings, we advocate that the ensemble method can be combined with Model Identification Adaptive Control (MIAC) for effective ensemble size adaptation. Specifically, we devise Adaptive Ensemble Q-learning (AdaEQ), a generalized ensemble method with two key steps: (a) approximation error characterization which serves as the feedback for flexibly controlling the ensemble size, and (b) ensemble size adaptation tailored towards minimizing the estimation bias. Extensive experiments are carried out to show that AdaEQ can improve the learning performance than the existing methods for the MuJoCo benchmark.
翻译:集成方法是缓解Q学习中的高估问题的一种有前途的方式,其中使用多个函数逼近器来估计动作值。众所周知,估计偏差在很大程度上依赖于集成大小(即目标中使用的Q函数逼近器的数量),并且由于学习过程中函数逼近误差随时间变化,确定“合适的”集成大小非常具有挑战性。为应对这一挑战,我们首先推导出估计偏差的上界和下界,基于此调整集成大小以驱动偏差趋近于零,从而相应的应对时变逼近误差的影响。受理论发现的启发,我们提出集成方法可以与模型辨识自适应控制(MIAC)结合以实现有效的集成大小自适应。具体而言,我们设计了自适应集成Q学习(AdaEQ),这是一种包含两个关键步骤的广义集成方法:(a)逼近误差表征,作为灵活控制集成大小的反馈;(b)集成大小自适应,专门用于最小化估计偏差。大量实验表明,在MuJoCo基准测试中,AdaEQ比现有方法能更有效地提升学习性能。