Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Adaptive Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two components: 1) Diversified critic ensemble: A diverse set of K critic networks is employed in parallel to stabilize Q-value estimation in robust adversarial reinforcement learning, reducing variance and enhancing robustness compared to conventional single-critic designs. 2) Time-varying Decay Uncertainty (TDU) mechanism: Moving beyond simple linear combinations, we propose a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to adaptively regulate the exploration-exploitation trade-off while stabilizing the training process. Comprehensive experiments across several challenging MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.
翻译:鲁棒对抗强化学习已成为训练智能体处理真实环境中不确定扰动的有效范式,在自动驾驶和机器人控制等序列决策领域具有关键应用。在该范式中,智能体训练通常被建模为主角与对手之间的零和马尔可夫博弈,以提升策略鲁棒性。然而,对手的可训练性不可避免地导致学习动态的非平稳性,从而加剧训练不稳定性和收敛困难,在高维复杂环境中尤为明显。本文提出一种新颖方法——用于鲁棒对抗强化学习的不确定性自适应评论家集成框架(UACER),其包含两个核心组件:1)多样化评论家集成:并行采用K个多样化评论家网络,以稳定鲁棒对抗强化学习中的Q值估计,相比传统单评论家设计能降低方差并增强鲁棒性;2)时变衰减不确定性机制:超越简单的线性组合,我们提出一种基于方差的Q值聚合策略,显式融入认知不确定性以自适应调节探索-利用权衡,同时稳定训练过程。在多个具有挑战性的MuJoCo控制问题上进行的综合实验验证了UACER的卓越有效性,其在整体性能、稳定性和效率方面均优于现有先进方法。