KITE: Keypoint-Conditioned Policies for Semantic Manipulation

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.

翻译：摘要：尽管自然语言为人类与机器人提供了便捷的共享接口，但使机器人能够解释并遵循语言指令仍是操控领域的一项长期挑战。实现高性能指令跟随机器人的关键步骤在于达成语义操控，即机器人能理解不同精度的语言指令——从"拿起毛绒玩具"这样的高层级指令，到"抓住大象的左耳"这类更详细的输入。为解决这一问题，我们提出"关键点+指令到执行"（KITE）框架，这是一个面向语义操控的两阶段框架，同时关注场景语义（区分视觉场景中的不同物体）与物体语义（精确定位物体实例内的不同部件）。KITE首先通过2D图像关键点将输入指令映射到视觉场景，为后续的动作推理提供高度精确的物体中心偏差。在获得RGB-D场景观测后，KITE执行学习到的关键点条件技能以完成指令。关键点的精准性与参数化技能的结合，实现了细粒度操控，并具备对场景与物体变化的泛化能力。实验表明，我们在三个真实世界环境中展示了KITE：长时程6自由度桌面操控、语义抓取以及高精度咖啡制作任务。在这些场景中，KITE的指令跟随成功率分别达到75%、70%和71%。相较于采用预训练视觉语言模型而非关键点定位、或舍弃技能而采用端到端视觉运动控制的框架，KITE表现更优，且训练所需演示数量更少或相当。补充材料、数据集、代码及视频可在我们的网站获取：http://tinyurl.com/kite-site。