We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for grasping. With semantic and physical safety in mind, these modules are interfaced with a real-time trajectory optimizer and a compliant tracking controller to enable human-robot proximity. We demonstrate performance for the following tasks: bi-arm sorting, bottle opening, and trash disposal tasks. These are done zero-shot where the models used have not been trained with any real world data from this bi-arm robot, scenes or workspace. Composing both learning- and non-learning-based components in a modular fashion with interpretable inputs and outputs allows the user to easily debug points of failures and fragilities. One may also in-place swap modules to improve the robustness of the overall platform, for instance with imitation-learned policies. https://sites.google.com/corp/view/safe-robots
翻译:本文提出一种具身人工智能系统,该系统接收人类发出的开放式自然语言指令,并控制两只机械臂在广阔工作空间内协作完成可能具有长时域特性的任务。我们的系统采用模块化架构:部署最先进的大型语言模型进行任务规划,利用视觉-语言模型实现语义感知,并采用点云Transformer完成抓取操作。基于语义安全与物理安全的双重考量,这些模块与实时轨迹优化器及顺应性跟踪控制器相集成,以实现人机近距离协同作业。我们在以下任务中展示了系统性能:双臂分拣、开瓶操作及垃圾处理任务。所有任务均以零样本方式完成,所用模型未经过任何来自该双臂机器人、场景或工作空间的真实数据训练。通过以模块化方式组合学习与非学习组件,并保持输入输出的可解释性,用户可以便捷地定位故障点与系统脆弱环节。用户还可现场替换模块以提升整体平台的鲁棒性,例如采用模仿学习策略进行模块置换。https://sites.google.com/corp/view/safe-robots