Conventional wisdom suggests that policy gradient methods are better suited to complex action spaces than action-value methods. However, foundational studies have shown equivalences between these paradigms in small and finite action spaces (O'Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm, but from universal principles that can also be applied to action-value methods to serve similar functionality. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces with a controllable computational cost that is comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE demonstrates strong performance on the DeepMind Control Suite, even when compared to the state-of-the-art methods such as DMPO and D4PG.
翻译:传统观点认为,策略梯度方法比动作价值方法更适合处理复杂动作空间。然而,基础研究表明,在小规模且有限的动作空间中,这两种范式是等价的(O'Donoghue 等人,2017;Schulman 等人,2017a)。这引发了一个问题:为何随着动作空间复杂度的增加,它们的计算适用性与性能会出现差异?我们假设,策略梯度方法在此类场景中表现出的明显优势并非源于该范式的内在特性,而是源于一些通用原则,这些原则同样可以应用于动作价值方法以实现类似功能。我们识别出三条此类原则,并提供了一个将其融入动作价值方法的框架。为支持我们的假设,我们将该框架实例化为 QMLE(基于最大似然估计的 Q 学习)。结果表明,QMLE 可应用于复杂动作空间,其计算成本可控且与策略梯度方法相当,且全程无需使用策略梯度。此外,QMLE 在 DeepMind Control Suite 上展现出强劲性能,即使与 DMPO、D4PG 等最先进方法相比也毫不逊色。