Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.
翻译:离散动作强化学习算法在处理高维离散动作空间时,常因可能的动作数量庞大而表现不佳。近期一项进展将多智能体强化学习中的值分解概念引入该问题。本研究深入剖析了这种值分解的影响,揭示其虽能抑制Q学习算法固有的过估计偏差,却会放大目标方差。为应对这一挑战,我们提出使用评论家集成来降低目标方差。此外,我们引入正则化损失,以减轻单维度探索动作对其它维度最优动作值的影响。在深度思维控制套件离散化版本上测试的新算法REValueD展现了优越性能,尤其在颇具挑战的人形机器人和犬类任务中。我们进一步剖析了影响REValueD性能的因素,评估了正则化损失的重要性以及REValueD随每维度子动作数增加的可扩展性。