Category-level 6D pose estimation aims to predict the poses and sizes of unseen objects from a specific category. Thanks to prior deformation, which explicitly adapts a category-specific 3D prior (i.e., a 3D template) to a given object instance, prior-based methods attained great success and have become a major research stream. However, obtaining category-specific priors requires collecting a large amount of 3D models, which is labor-consuming and often not accessible in practice. This motivates us to investigate whether priors are necessary to make prior-based methods effective. Our empirical study shows that the 3D prior itself is not the credit to the high performance. The keypoint actually is the explicit deformation process, which aligns camera and world coordinates supervised by world-space 3D models (also called canonical space). Inspired by these observations, we introduce a simple prior-free implicit space transformation network, namely IST-Net, to transform camera-space features to world-space counterparts and build correspondence between them in an implicit manner without relying on 3D priors. Besides, we design camera- and world-space enhancers to enrich the features with pose-sensitive information and geometrical constraints, respectively. Albeit simple, IST-Net achieves state-of-the-art performance based-on prior-free design, with top inference speed on the REAL275 benchmark. Our code and models are available at https://github.com/CVMI-Lab/IST-Net.
翻译:类别级6D姿态估计旨在预测特定类别中未知物体的姿态和尺寸。得益于先验形变技术——即将类别特定的3D先验(即3D模板)显式适配至给定物体实例——基于先验的方法取得了巨大成功并成为主流研究路径。然而,获取类别相关先验需要收集大量3D模型,这一过程既耗时又常在实际应用中难以实现。这促使我们探究先验是否对基于先验方法的有效性不可或缺。实证研究表明,3D先验本身并非高性能的关键,真正的核心在于显式形变过程——该过程通过世界空间3D模型(亦称规范空间)监督相机坐标系与世界坐标系的对齐。受此启发,我们提出了一种简洁的无先验隐式空间变换网络IST-Net,能够不依赖3D先验,以隐式方式将相机空间特征转换至世界空间特征,并建立二者间的对应关系。此外,我们分别设计了相机空间增强器与世界空间增强器,以赋予特征姿态敏感信息与几何约束。虽设计简洁,IST-Net基于无先验架构在REAL275基准上实现了最先进性能与顶尖推理速度。本方法与模型已开源至https://github.com/CVMI-Lab/IST-Net。