The tuning of hyperparameters in reinforcement learning (RL) is critical, as these parameters significantly impact an agent's performance and learning efficiency. Dynamic adjustment of hyperparameters during the training process can significantly enhance both the performance and stability of learning. Population-based training (PBT) provides a method to achieve this by continuously tuning hyperparameters throughout the training. This ongoing adjustment enables models to adapt to different learning stages, resulting in faster convergence and overall improved performance. In this paper, we propose an enhancement to PBT by simultaneously utilizing both first- and second-order optimizers within a single population. We conducted a series of experiments using the TD3 algorithm across various MuJoCo environments. Our results, for the first time, empirically demonstrate the potential of incorporating second-order optimizers within PBT-based RL. Specifically, the combination of the K-FAC optimizer with Adam led to up to a 10% improvement in overall performance compared to PBT using only Adam. Additionally, in environments where Adam occasionally fails, such as the Swimmer environment, the mixed population with K-FAC exhibited more reliable learning outcomes, offering a significant advantage in training stability without a substantial increase in computational time.
翻译:强化学习(RL)中超参数的调优至关重要,因为这些参数显著影响智能体的性能与学习效率。在训练过程中动态调整超参数能够大幅提升学习的性能与稳定性。基于群体的训练(PBT)提供了一种方法,通过在训练全程持续调优超参数来实现这一目标。这种持续的调整使得模型能够适应不同的学习阶段,从而实现更快的收敛与整体性能的提升。本文提出一种PBT的改进方法,在单一群体内同时利用一阶与二阶优化器。我们使用TD3算法在多种MuJoCo环境中进行了一系列实验。我们的结果首次通过实验验证了在基于PBT的强化学习中引入二阶优化器的潜力。具体而言,与仅使用Adam的PBT相比,K-FAC优化器与Adam的组合使整体性能提升了高达10%。此外,在Adam偶尔失效的环境中(如Swimmer环境),包含K-FAC的混合群体表现出更可靠的学习结果,在训练稳定性方面提供了显著优势,且未大幅增加计算时间。