Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose \textbf{Discrete Policy}, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26\% higher than Diffusion Policy and 15\% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5\%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.

翻译：学习多任务机器人操作的视觉运动策略一直是机器人学界长期面临的挑战。其难点在于动作空间的多样性：通常，一个目标可以通过多种方式实现，导致单个任务的动作分布呈现多模态特性。随着任务数量的增加，动作分布的复杂性也随之上升。在本研究中，我们提出 \textbf{离散策略}，一种用于训练具备多任务操作技能的通用智能体的机器人学习方法。离散策略采用向量量化技术，将动作序列映射到离散的潜在空间，从而促进任务特定编码的学习。这些编码随后根据观测数据和语言指令被重构回动作空间。我们在仿真环境和多种真实世界机器人平台上评估了我们的方法，包括单臂和双臂机器人设置。实验结果表明，我们提出的离散策略优于成熟的扩散策略基线以及包括 ACT、Octo 和 OpenVLA 在内的多种先进方法。例如，在包含五个任务的真实世界多任务训练场景中，离散策略的平均成功率比扩散策略高 26\%，比 OpenVLA 高 15\%。当任务数量增加到 12 个时，离散策略与扩散策略之间的性能差距扩大至 32.5\%，进一步凸显了我们方法的优势。我们的工作通过实证表明，在潜在空间内学习多任务策略是实现通用智能体的关键一步。