Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost. To reduce this computational burden, we introduce CrossQ: A lightweight algorithm for continuous control tasks that makes careful use of Batch Normalization and removes target networks to surpass the current state-of-the-art in sample efficiency while maintaining a low UTD ratio of 1. Notably, CrossQ does not rely on advanced bias-reduction schemes used in current methods. CrossQ's contributions are threefold: (1) it matches or surpasses current state-of-the-art methods in terms of sample efficiency, (2) it substantially reduces the computational cost compared to REDQ and DroQ, (3) it is easy to implement, requiring just a few lines of code on top of SAC.
翻译:样本效率是深度强化学习中的关键问题。近期算法(如REDQ和DroQ)通过将更新-数据比(UTD)提升至每个环境样本对评论家进行20次梯度更新,找到了提升样本效率的途径。然而,这以大幅增加计算成本为代价。为降低计算负担,我们提出CrossQ:一种用于连续控制任务的轻量级算法,通过审慎使用批量归一化并去除目标网络,在保持低UTD比率(1:1)的同时,超越了当前最先进的样本效率。值得注意的是,CrossQ不依赖当前方法中使用的先进偏差降低方案。CrossQ的贡献有三方面:(1)在样本效率方面匹配或超越当前最先进方法;(2)相比REDQ和DroQ大幅降低计算成本;(3)易于实现,仅需在SAC基础上添加数行代码。