DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.

翻译：从RGB-D图像对中估计开放世界中物体的6D姿态并重构其3D形状是一项具有挑战性的任务。许多现有方法依赖于学习与特定模板对应的几何特征，而忽略了同一类别物体间的形状差异和姿态变化。因此，这些方法在复杂场景中处理未见过的物体实例时表现不佳。相比之下，其他方法试图通过利用归一化几何结构先验来实现类别级估计与重构，但基于静态先验的重构难以应对显著的类内变化。为解决这些问题，我们提出DTF-Net，一种基于物体类别隐式神经场的姿态估计与形状重构新框架。在DTF-Net中，我们设计了一个可变形模板场，用于表示通用的类别级形状潜在特征和类别内几何变形特征。该场建立了连续的形状对应关系，将类别模板变形为任意观察到的实例以完成形状重构。我们引入了一个姿态回归模块，该模块共享来自场的变形特征和模板编码，以估计场景中每个物体的精确6D姿态。我们集成一个多模态表示提取模块，用于提取物体特征和语义掩码，实现端到端推理。此外，在训练过程中，我们实施了一种形状不变训练策略和一种视点采样方法，以进一步增强模型提取物体姿态特征的能力。在REAL275和CAMERA25数据集上的大量实验证明了DTF-Net在合成和真实场景中的优越性。此外，我们展示了DTF-Net有效支持真实机器人手臂的抓取任务。