Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.
翻译:融合语言、视觉及近期动作能力的基础模型,已彻底改变了利用互联网规模数据进行任务推理的能力。然而,训练具身基础模型的关键挑战之一在于缺乏基于物理世界的真实数据。本文提出AutoRT系统,该系统利用现有基础模型,以最少的人工监督,在完全未见过的场景中大规模部署可操作的机器人。AutoRT利用视觉-语言模型进行场景理解与实体定位,并进一步使用大语言模型为机器人集群生成多样化、新颖的任务指令。通过挖掘基础模型的知识来指导数据收集,使AutoRT能够有效权衡自主性与安全性,同时显著扩展机器人学习的数据收集规模。我们展示了AutoRT在多个建筑中为超过20台机器人生成指令,并通过遥操作与自主策略收集了7.7万条真实机器人交互轨迹。实验表明,AutoRT收集的此类“野外”数据具有显著更高的多样性,且其利用大语言模型的能力使得指令跟随型数据收集机器人能够更好地与人类偏好对齐。