We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional representations, our representation affords efficient collision checking and language grounding. Compared to state-of-the-art baselines, our framework achieves superior estimation and motion planning performance from sparse RGB images and zero-shot generalizes to real-world scenarios with diverse materials, including transparent and reflective objects, despite being trained exclusively in simulation. Our code for data generation, training, inference, and pre-trained weights are publicly available at: https://sites.google.com/view/def-oricorn/home.
翻译:我们提出了DEF-oriCORN,一个用于语言指令操控任务的框架。通过利用一种新颖的基于对象的场景表示和基于扩散模型的状态估计算法,我们的框架能够实现高效且稳健的操控规划,以响应口头指令,即使在相机视角稀疏、物体紧密堆积的环境中,也无需任何演示。与传统表示方法不同,我们的表示支持高效的碰撞检测和语言接地。与最先进的基线方法相比,我们的框架从稀疏的RGB图像中实现了更优的状态估计和运动规划性能,并且能够零样本泛化到包含透明和反射物体等多种材质的真实世界场景,尽管其训练完全在仿真中进行。我们的数据生成、训练、推理代码以及预训练权重已在以下网址公开:https://sites.google.com/view/def-oricorn/home。