Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference

Humans can easily deduce the relative pose of an unseen object, without label/training, given only a single query-reference image pair. This is arguably achieved by incorporating (i) 3D/2.5D shape perception from a single image, (ii) render-and-compare simulation, and (iii) rich semantic cue awareness to furnish (coarse) reference-query correspondence. Existing methods implement (i) by a 3D CAD model or well-calibrated multiple images and (ii) by training a network on specific objects, which necessitate laborious ground-truth labeling and tedious training, potentially leading to challenges in generalization. Moreover, (iii) was less exploited in the paradigm of (ii), despite that the coarse correspondence from (iii) enhances the compare process by filtering out non-overlapped parts under substantial pose differences/occlusions. Motivated by this, we propose a novel 3D generalizable relative pose estimation method by elaborating (i) with a 2.5D shape from an RGB-D reference, (ii) with an off-the-shelf differentiable renderer, and (iii) with semantic cues from a pretrained model like DINOv2. Specifically, our differentiable renderer takes the 2.5D rotatable mesh textured by the RGB and the semantic maps (obtained by DINOv2 from the RGB input), then renders new RGB and semantic maps (with back-surface culling) under a novel rotated view. The refinement loss comes from comparing the rendered RGB and semantic maps with the query ones, back-propagating the gradients through the differentiable renderer to refine the 3D relative pose. As a result, our method can be readily applied to unseen objects, given only a single RGB-D reference, without label/training. Extensive experiments on LineMOD, LM-O, and YCB-V show that our training-free method significantly outperforms the SOTA supervised methods, especially under the rigorous Acc@5/10/15{\deg} metrics and the challenging cross-dataset settings.

翻译：人类仅需单张查询-参考图像对，无需标注/训练即可轻松推断未知物体的相对姿态。这主要通过整合以下能力实现：(i) 从单张图像感知3D/2.5D形状，(ii) 渲染-比较模拟，以及(iii) 利用丰富语义线索建立（粗略的）参考-查询对应关系。现有方法通过3D CAD模型或精确标定的多张图像实现(i)，通过针对特定物体训练网络实现(ii)，这些方法需要费力的真实标注和繁琐训练，可能导致泛化能力受限。此外，尽管(iii)提供的粗略对应能通过过滤大幅姿态差异/遮挡下的非重叠部分来增强比较过程，但在(ii)的范式中较少被利用。受此启发，我们提出一种新颖的3D泛化相对姿态估计方法：通过RGB-D参考图像获取2.5D形状实现(i)，利用现成的可微分渲染器实现(ii)，并借助DINOv2等预训练模型提取语义线索实现(iii)。具体而言，我们的可微分渲染器接收由RGB和语义图（通过DINOv2从RGB输入提取）纹理化的2.5D可旋转网格，在新旋转视角下渲染新的RGB和语义图（含背面剔除）。优化损失通过比较渲染的RGB/语义图与查询图像获得，梯度通过可微分渲染器反向传播以优化3D相对姿态。因此，本方法仅需单张RGB-D参考图像即可直接应用于未知物体，无需标注/训练。在LineMOD、LM-O和YCB-V数据集上的大量实验表明，这种无需训练的方法显著优于当前最先进的监督方法，尤其在严格的Acc@5/10/15{\deg}指标和具有挑战性的跨数据集设定下表现突出。