Quality-Diversity (QD) algorithms excel at discovering diverse repertoires of skills, but are hindered by poor sample efficiency and often require tens of millions of environment steps to solve complex locomotion tasks. Recent advances in Reinforcement Learning (RL) have shown that high Update-to-Data (UTD) ratios accelerate Actor-Critic learning. While effective, standard high-UTD algorithms typically utilise target networks to stabilise training. This requirement introduces a significant computational bottleneck, rendering them impractical for resource-intensive Quality-Diversity (QD) tasks where sample efficiency and rapid population adaptation are critical. In this paper, we introduce QDHUAC, a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines. Our results suggest that combining target-free distributional critics with dominance-based selection is a key enabler for the next generation of sample-efficient evolutionary RL algorithms.
翻译:质量-多样性(QD)算法在发现多样化的技能库方面表现卓越,但受限于样本效率低下,通常需要数千万的环境步数才能解决复杂的运动控制任务。强化学习(RL)的最新进展表明,高更新-数据比(UTD)能加速Actor-Critic学习。然而,标准的高UTD算法通常依赖目标网络来稳定训练,这一需求引入了显著的计算瓶颈,使其难以应用于对样本效率和种群快速适应至关重要的资源密集型质量-多样性(QD)任务。本文提出QDHUAC——一种样本高效、无目标网络且分布式的QD-RL算法,它提供密集且低方差的梯度信号,使得主从新颖性搜索能够在减少一个数量级的环境步数下进行高UTD训练。我们证明,该方法能在高UTD比率下实现稳定训练,在高维Brax环境中以比基线少一个数量级的样本量达到具有竞争力的覆盖度和适应度。研究结果表明,将无目标网络的分布式评论家与基于主导性的选择相结合,是下一代样本高效进化强化学习算法的关键推动因素。