The offline reinforcement learning (RL) paradigm provides a general recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. While policy constraints, conservatism, and other methods for mitigating distributional shifts have made offline reinforcement learning more effective, the continuous action setting often necessitates various approximations for applying these techniques. Many of these challenges are greatly alleviated in discrete action settings, where offline RL constraints and regularizers can often be computed more precisely or even exactly. In this paper, we propose an adaptive scheme for action quantization. We use a VQ-VAE to learn state-conditioned action quantization, avoiding the exponential blowup that comes with na\"ive discretization of the action space. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme. We further validate our approach on a set of challenging long-horizon complex robotic manipulation tasks in the Robomimic environment, where our discretized offline RL algorithms are able to improve upon their continuous counterparts by 2-3x. Our project page is at https://saqrl.github.io/
翻译:离线强化学习范式提供了一种通用方法,能够将静态行为数据集转化为性能优于数据收集策略的策略。虽然策略约束、保守性及其他缓解分布偏移的方法已使离线强化学习更加有效,但连续动作设定往往需要对这些技术进行各种近似处理。许多挑战在离散动作设定中得到显著缓解,因为离线强化学习的约束和正则化项通常可以更精确甚至精确地计算。本文提出了一种自适应的动作量化方案。我们使用VQ-VAE学习基于状态的动作量化,避免了动作空间朴素离散化带来的指数级膨胀。实验表明,当与我们提出的离散化方案结合时,IQL、CQL和BRAC等若干最先进的离线强化学习方法在基准测试中性能有所提升。我们进一步在Robomimic环境中一组具有挑战性的长时域复杂机器人操作任务上验证了该方法,其中离散化的离线强化学习算法相比连续版本性能提升了2-3倍。项目页面详见https://saqrl.github.io/