KITE: Keypoint-Conditioned Policies for Semantic Manipulation

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.

翻译：摘要：虽然自然语言为人类与机器人提供了便捷的共享交互界面，但使机器人能够理解并遵循语言指令仍是操控领域长期存在的挑战。实现高性能指令跟随机器人的关键步骤在于语义操控——即机器人能理解不同精细程度的语言指令，从"拿起毛绒玩具"这类高层级指令，到"抓住大象的左耳"这类具体化输入。为此，我们提出"关键点+指令到执行"（KITE）框架，这是一个两阶段语义操控方案，同时关注场景语义（区分视觉场景中的不同物体）与物体语义（精准定位单一物体实例的各个部件）。KITE首先通过2D图像关键点将输入指令映射至视觉场景，为后续动作推理提供高度精准的以物体为中心的偏置。给定RGB-D场景观测后，KITE执行学习得到的关键点条件技能以完成指令。关键点与参数化技能的联合精度实现了细粒度操控，并具备对场景与物体变化的泛化能力。实验表明，KITE在三个真实世界环境中得到验证：长时程6自由度桌面操控、语义抓取与高精度咖啡制作任务。在这些场景中，KITE分别实现了75%、70%和71%的指令跟随总体成功率。相比采用预训练视觉语言模型替代基于关键点的场景映射，或为端到端视觉运动控制而省略技能设计的框架，KITE均展现出更优性能，且所需演示数据更少或相当。补充材料、数据集、代码及视频详见项目网站：http://tinyurl.com/kite-site。