Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

Jesse Farebrother,Jordi Orbay,Quan Vuong,Adrien Ali Taïga,Yevgen Chebotar,Ted Xiao,Alex Irpan,Sergey Levine,Pablo Samuel Castro,Aleksandra Faust,Aviral Kumar,Rishabh Agarwal

Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.

翻译：价值函数是深度强化学习的核心组成部分。这些由神经网络参数化的函数通过均方误差回归目标进行训练，以匹配自举目标值。然而，将使用回归的价值型强化学习方法扩展到大规模网络（例如高容量Transformer）已被证明具有挑战性。这一困难与监督学习形成鲜明对比：通过利用交叉熵分类损失，监督方法已可靠地扩展到大规模网络。观察到这一差异，本文探究了深度强化学习的可扩展性是否也能简单地通过使用分类替代回归来训练价值函数而得到提升。我们证明了使用分类交叉熵训练的价值函数在多个领域显著提升了性能和可扩展性，这些领域包括：基于SoftMoEs的Atari 2600游戏单任务强化学习、基于大规模ResNet的Atari多任务强化学习、基于Q-transformers的机器人操作、无需搜索的国际象棋对弈，以及基于高容量Transformer的语言代理Wordle任务，并在这些领域取得了最先进的结果。通过仔细分析，我们表明分类交叉熵的优势主要源于其缓解价值型强化学习固有问题的能力，例如噪声目标和非平稳性。总体而言，我们认为简单地将价值函数训练转向分类交叉熵，可以在几乎零成本的前提下显著提升深度强化学习的可扩展性。