Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.
翻译:视觉强化学习在机器人领域前景广阔但代价高昂——离策略方法样本效率高但速度慢;在策略方法并行性好但浪费样本。近期研究表明,对于基于状态的控制任务,离策略方法在挂钟时间上可以比在策略方法训练得更快。将这一优势扩展到视觉领域仍然具有挑战性,高维输入图像会使训练动态复杂化,并带来显著的存储和编码开销。为解决这些挑战,我们提出Squint,一种视觉软演员-评论家方法,其挂钟训练速度超越了先前的视觉离策略与在策略方法。Squint通过并行仿真、分布式评论家、分辨率压缩、层归一化、优化的更新-数据比以及工程实现优化来实现这一目标。我们在SO-101任务集(ManiSkill3中具有重度领域随机化的八项操作任务新测试套件)上进行评估,并展示了向真实SO-101机器人的仿真到现实迁移能力。我们在单块RTX 3090 GPU上训练策略15分钟,大多数任务在6分钟内即可收敛。