KITE: Keypoint-Conditioned Policies for Semantic Manipulation

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.

翻译：摘要：尽管自然语言为人类与机器人提供了便捷的共享交互界面，但如何使机器人理解并执行语言指令仍是操作领域长期存在的挑战。实现高指令跟随能力的机器人关键在于达成语义操作——即机器人能解析不同精细度的语言指令，从"拾起毛绒玩具"这类高层级指令，到"抓住大象的左耳"这种精细化输入。为此，我们提出"关键点+指令到执行"（KITE）两阶段语义操作框架，该框架同时关注场景语义（区分视觉场景中不同物体）与对象语义（精准定位物体实例的各个部件）。KITE首先通过二维图像关键点将输入指令锚定至视觉场景，为后续动作推理提供高精度对象中心化偏置。在获取RGB-D场景观测后，KITE执行经学习得到的基于关键点的技能以完成指令。关键点与参数化技能的协同精度使细粒度操作成为可能，并能泛化至场景与对象的多样化情形。我们在三种真实环境——长时域六自由度桌面操作、语义抓取，以及高精度咖啡制作任务中验证了KITE，其指令跟随整体成功率分别达到75%、70%和71%。相比采用预训练视觉语言模型而忽略关键点锚定，或舍弃技能改用端到端视觉运动控制的方案，KITE均展现出更优性能，且训练所需示范数据更少或相当。补充材料、数据集、代码及演示视频请访问我们的网站：http://tinyurl.com/kite-site。