RT-H: Action Hierarchies Using Language

Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.

翻译：语言提供了一种将复杂概念分解为可消化片段的方式。机器人模仿学习的最新研究采用语言条件策略，根据视觉观测和语言指定的高层任务预测动作。这些方法利用自然语言的结构，在多任务数据集中共享语义相似任务（例如“拿起可乐罐”和“拿起苹果”）间的数据。然而，当任务语义差异增大时（例如“拿起可乐罐”和“倾倒杯子”），任务间数据共享变得困难，因此学习将高层任务映射到动作需要更多的演示数据。为连接任务与动作，我们的核心思路是教授机器人动作语言，使用更细粒度的短语（如“手臂前移”）描述底层运动。预测这些语言动作作为任务与动作之间的中间步骤，迫使策略学习看似不同任务间底层运动的共享结构。此外，基于语言动作的条件策略可在执行过程中通过人类指定的语言动作轻松校正。这为实现灵活策略提供了新范式，使其能够从人类语言干预中学习。我们的方法RT-H利用语言动作构建动作层次结构：首先学习预测语言动作，然后基于此预测结果和高级任务，结合各阶段的视觉上下文预测最终动作。实验表明，RT-H通过这种语言-动作层次结构，能有效利用多任务数据集学习更鲁棒、更灵活的策略。这些策略不仅能响应语言干预，还能从此类干预中持续学习，其性能优于通过遥操作干预学习的方法。项目网站与视频详见 https://rt-hierarchy.github.io。