Existing off-policy reinforcement learning algorithms typically necessitate an explicit state-action-value function representation, which becomes problematic in high-dimensional action spaces. These algorithms often encounter challenges where they struggle with the curse of dimensionality, as maintaining a state-action-value function in such spaces becomes data-inefficient. In this work, we propose a novel off-policy trust region optimization approach, called Vlearn, that eliminates the requirement for an explicit state-action-value function. Instead, we demonstrate how to efficiently leverage just a state-value function as the critic, thus overcoming several limitations of existing methods. By doing so, Vlearn addresses the computational challenges posed by high-dimensional action spaces. Furthermore, Vlearn introduces an efficient approach to address the challenges associated with pure state-value function learning in the off-policy setting. This approach not only simplifies the implementation of off-policy policy gradient algorithms but also leads to consistent and robust performance across various benchmark tasks. Specifically, by removing the need for a state-action-value function Vlearn simplifies the learning process and allows for more efficient exploration and exploitation in complex environments
翻译:现有的离策略强化学习算法通常需要显式的状态-动作值函数表示,这在高维动作空间中会引发问题。这些算法常常面临维数灾难的挑战,因为在这种空间中维护状态-动作值函数会导致数据效率低下。本文提出一种名为Vlearn的新型离策略信任区域优化方法,该方法消除了对显式状态-动作值函数的需求。相反,我们展示了如何有效利用仅含状态值函数的评论家,从而克服了现有方法的多项局限。通过这一设计,Vlearn解决了高维动作空间带来的计算挑战。此外,Vlearn引入了一种高效方法,以应对离策略环境下纯状态值函数学习所面临的挑战。该方法不仅简化了离策略策略梯度算法的实现,还在多种基准任务中实现了稳定且鲁棒的性能。具体而言,通过移除对状态-动作值函数的需求,Vlearn简化了学习过程,并允许在复杂环境中进行更高效的探索与利用。