Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.
翻译:离散动作强化学习算法在处理高维离散动作空间任务时,常因动作数量庞大而表现不佳。近期一项研究借鉴多智能体强化学习中的值分解概念来应对这一挑战。本研究深入剖析了这种值分解的作用机理,揭示其虽能抑制Q学习算法固有的过估计偏差,但会放大目标方差。为抵消该负面影响,我们引入评论家集成机制以降低目标方差。此外,我们设计了一种正则化损失函数,可缓解某一维度的探索性动作对其他维度最优动作值产生的干扰效应。经过基于深度思维控制套件离散化版本的测试,我们提出的新算法REValueD展现出卓越性能,尤其是在高难度的人形机器人和犬类机器人任务中。我们进一步剖析了影响REValueD表现的关键因素,评估了正则化损失的重要性,并检验了该算法随每维度子动作数量增加时的可扩展性。