KITE: Keypoint-Conditioned Policies for Semantic Manipulation

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.

翻译：摘要：尽管自然语言为人类与机器人提供了便捷的共享交互界面，但使机器人能够理解并遵循语言指令仍是操控领域长期存在的挑战。实现高效指令跟随机器人的关键步骤在于达成语义操控——即机器人能够解析不同精细程度的语言描述，从"拾起毛绒玩具"这类高层指令，到"抓住大象的左耳"等更详细的输入。为解决此问题，我们提出"关键点+指令到执行"（KITE）的两阶段语义操控框架，该框架同时关注场景语义（区分视觉场景中的不同物体）与对象语义（精准定位物体实例的不同部件）。KITE首先通过二维图像关键点将输入指令映射至视觉场景，为后续动作推理提供高精度的以物体为中心的偏置信息。结合RGB-D场景观测后，KITE执行习得的关键点约束技能以完成指令。关键点与参数化技能的协同精度支持细粒度操控，并泛化至场景与物体变化。实验表明，KITE在三种真实环境中表现优异：长时域六自由度桌面操控、语义抓取及高精度咖啡制作任务。在这些场景下，KITE的指令跟随成功率分别达到75%、70%和71%。相较采用预训练视觉语言模型进行关键点定位或舍弃技能采用端到端视觉运动控制的框架，KITE在训练数据量更少或相当的情况下展现出明显优势。补充材料、数据集、代码及演示视频详见项目网站：http://tinyurl.com/kite-site。