Scene representation has been a crucial design choice in robotic manipulation systems. An ideal representation should be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields - dynamic 3D descriptor fields. These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to a wide range of robotic manipulation tasks in a zero-shot manner. Through extensive evaluation in both real-world scenarios and simulations, we demonstrate that D$^3$Fields are both generalizable and effective for zero-shot robotic manipulation tasks. In quantitative comparisons with state-of-the-art dense descriptors, such as Dense Object Nets and DINO, D$^3$Fields exhibit significantly better generalization abilities and manipulation accuracy.
翻译:场景表示一直是机器人操作系统中的关键设计选择。理想的表示应具备三维性、动态性和语义性,以满足多样化操作任务的需求。然而,以往的工作往往无法同时具备这三个属性。在这项工作中,我们引入了D$^3$Fields——动态三维描述符场。这些场捕捉底层三维环境的动态,并编码语义特征和实例掩码。具体而言,我们将工作空间中的任意三维点投影到多视角二维视觉观测上,并插值从基础模型导出的特征。由此产生的融合描述符场允许使用具有不同上下文、风格和实例的二维图像进行灵活的目标指定。为了评估这些描述符场的有效性,我们将这种表示以零样本方式应用于广泛的机器人操作任务。通过在真实场景和仿真中的大量评估,我们证明D$^3$Fields对于零样本机器人操作任务既具有泛化能力又有效。在与最先进的密集描述符(如Dense Object Nets和DINO)的定量比较中,D$^3$Fields展现出显著更优的泛化能力和操作精度。