We address the task of estimating 6D camera poses from sparse-view image sets (2-8 images). This task is a vital pre-processing stage for nearly all contemporary (neural) reconstruction algorithms but remains challenging given sparse views, especially for objects with visual symmetries and texture-less surfaces. We build on the recent RelPose framework which learns a network that infers distributions over relative rotations over image pairs. We extend this approach in two key ways; first, we use attentional transformer layers to process multiple images jointly, since additional views of an object may resolve ambiguous symmetries in any given image pair (such as the handle of a mug that becomes visible in a third view). Second, we augment this network to also report camera translations by defining an appropriate coordinate system that decouples the ambiguity in rotation estimation from translation prediction. Our final system results in large improvements in 6D pose prediction over prior art on both seen and unseen object categories and also enables pose estimation and 3D reconstruction for in-the-wild objects.
翻译:我们研究了从稀疏视角图像集(2-8张图像)中估计6D相机位姿的任务。该任务是所有当代(神经)重建算法中至关重要的预处理阶段,但由于观察视角稀疏,尤其是对于具有视觉对称性和无纹理表面的物体,该任务仍具挑战性。我们基于近期RelPose框架展开工作,该框架通过学习一个网络来推断图像对间相对旋转的分布。我们从两个关键方向扩展了该方法:首先,采用注意力变换层联合处理多张图像,因为物体的额外视角可能解决任意图像对中存在的模糊对称性问题(例如在第三个视角中可见的马克杯手柄)。其次,通过定义能解耦旋转估计与平移预测中模糊性的恰当坐标系,增强该网络同时输出相机平移量的能力。我们的最终系统在已见和未见物体类别上的6D位姿预测均显著优于现有方法,同时还能为现实场景中的物体实现位姿估计与三维重建。