Enabling robots to perform novel manipulation tasks from natural language instructions remains a fundamental challenge in robotics, despite significant progress in generalized problem solving with foundational models. Large vision and language models (VLMs) are capable of processing high-dimensional input data for visual scene and language understanding, as well as decomposing tasks into a sequence of logical steps; however, they struggle to ground those steps in embodied robot motion. On the other hand, robotics foundation models output action commands, but require in-domain fine-tuning or experience before they are able to perform novel tasks successfully. At its core, there still remains the fundamental challenge of connecting abstract task reasoning with low-level motion control. To address this disconnect, we propose Language Movement Primitives (LMPs), a framework that grounds VLM reasoning in Dynamic Movement Primitive (DMP) parameterization. Our key insight is that DMPs provide a small number of interpretable parameters, and VLMs can set these parameters to specify diverse, continuous, and stable trajectories. Put another way: VLMs can reason over free-form natural language task descriptions, and semantically ground their desired motions into DMPs -- bridging the gap between high-level task reasoning and low-level position and velocity control. Building on this combination of VLMs and DMPs, we formulate our LMP pipeline for zero-shot robot manipulation that effectively completes tabletop manipulation problems by generating a sequence of DMP motions. Across 20 real-world manipulation tasks, we show that LMP achieves 80% task success as compared to 31% for the best-performing baseline. See videos at our website: https://collab.me.vt.edu/lmp
翻译:尽管基础模型在通用问题解决方面取得了显著进展,但让机器人根据自然语言指令执行新颖的操纵任务仍然是机器人学中的一个根本性挑战。大型视觉与语言模型(VLMs)能够处理高维输入数据以实现视觉场景和语言理解,并将任务分解为一系列逻辑步骤;然而,这些模型难以将这些步骤具体落实到具身的机器人运动中。另一方面,机器人学基础模型能够输出动作指令,但在成功执行新任务之前需要进行领域内微调或经验积累。其核心问题仍然在于如何将抽象的任务推理与低层运动控制连接起来。为解决这一脱节问题,我们提出了语言运动基元(LMPs)框架,该框架将VLM的推理过程建立在动态运动基元(DMP)参数化基础上。我们的关键见解是:DMPs提供了少量可解释的参数,而VLMs能够通过设置这些参数来生成多样化、连续且稳定的轨迹。换言之,VLMs能够对自由形式的自然语言任务描述进行推理,并将其期望的运动语义地映射到DMPs中——从而弥合高层任务推理与低层位置/速度控制之间的鸿沟。基于VLMs与DMPs的结合,我们构建了零样本机器人操纵的LMP流程,通过生成一系列DMP运动有效完成桌面操纵任务。在20个真实世界操纵任务的实验中,LMP实现了80%的任务成功率,而最佳基线方法仅为31%。演示视频请访问项目网站:https://collab.me.vt.edu/lmp