Ensembles are ubiquitous in off-policy actor-critic learning, yet their efficacy depends critically on how they are aggregated. Current methods typically rely on static rules or task-specific hyperparameters to balance overestimation bias and variance, leaving the challenge of a truly adaptive approach open. We introduce Adaptive Ensemble Aggregation (AEA), an algorithm that dynamically constructs ensemble-based targets for both critic and actor updates directly from training dynamics. We prove that AEA converges to a unique equilibrium where the aggregation parameter minimizes value estimation error within a defined stability region. Theoretically, we establish that AEA achieves a shrinkage property where the estimation bias vanishes as the total ensemble size grows. Unlike subset-based methods like REDQ, which hit an information bottleneck determined by a fixed variance floor regardless of the ensemble size, AEA exploits the full ensemble to achieve optimal variance reduction-scaling inversely with the total number of models-and maximal Fisher information. Furthermore, we provide a formal guarantee for monotonic policy improvement under this adaptive regime. Extensive evaluations on various continuous control tasks demonstrate that AEA outperforms, on the majority of tasks, state-of-the-art baselines, providing a robust and self-calibrating framework for ensemble-based reinforcement learning.
翻译:集成方法在离策略演员-评论家学习中广泛应用,但其有效性关键取决于聚合方式。当前方法通常依赖静态规则或任务特定超参数来平衡过估计偏差与方差,真正的自适应方法仍面临挑战。我们提出自适应集成聚合(AEA)算法,该算法直接从训练动态中为演员和评论家更新动态构建基于集成的目标值。我们证明AEA收敛至唯一均衡点,在此状态下聚合参数能在定义的稳定区域内最小化价值估计误差。理论分析表明,AEA具有收缩特性:当集成总规模增大时,估计偏差趋近于零。与REDQ等基于子集的方法(其信息瓶颈由固定方差下限决定,不受集成规模影响)不同,AEA通过利用完整集成实现最优方差缩减——该缩减幅度与模型总数成反比——并达到最大Fisher信息量。此外,我们提供了自适应机制下策略单调改进的严格保证。在多个连续控制任务上的广泛评估表明,AEA在绝大多数任务中超越现有最优基线方法,为基于集成的强化学习提供了鲁棒且自校准的框架。