Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.
翻译:融合语言、视觉及近期动作能力的基础模型已彻底改变了利用互联网规模数据推理有用任务的方式。然而,训练具身基础模型的关键挑战之一在于缺乏基于物理世界的数据。本文提出AutoRT系统,该系统利用现有基础模型,在几乎无需人工监督的情况下,将可操作机器人的部署扩展至完全未知的场景。AutoRT利用视觉语言模型(VLM)进行场景理解与 grounding,并进一步使用大型语言模型(LLM)为机器人集群提出多样化的新颖指令。通过调用基础模型的知识引导数据采集,AutoRT能够在显著扩展机器人学习数据采集规模的同时,有效推理自主性与安全性的权衡。我们展示了AutoRT在跨多栋建筑的场景中向20余台机器人提出指令,并通过遥操作及自主机器人策略收集了77k个真实机器人交互片段。实验证明,AutoRT采集的此类"野外"数据具有显著多样性,且其利用LLM实现的指令遵循数据采集机器人能够与人类偏好对齐。