In many automation tasks involving manipulation of rigid objects, the poses of the objects must be acquired. Vision-based pose estimation using a single RGB or RGB-D sensor is especially popular due to its broad applicability. However, single-view pose estimation is inherently limited by depth ambiguity and ambiguities imposed by various phenomena like occlusion, self-occlusion, reflections, etc. Aggregation of information from multiple views can potentially resolve these ambiguities, but the current state-of-the-art multi-view pose estimation method only uses multiple views to aggregate single-view pose estimates, and thus rely on obtaining good single-view estimates. We present a multi-view pose estimation method which aggregates learned 2D-3D distributions from multiple views for both the initial estimate and optional refinement. Our method performs probabilistic sampling of 3D-3D correspondences under epipolar constraints using learned 2D-3D correspondence distributions which are implicitly trained to respect visual ambiguities such as symmetry. Evaluation on the T-LESS dataset shows that our method reduces pose estimation errors by 80-91% compared to the best single-view method, and we present state-of-the-art results on T-LESS with four views, even compared with methods using five and eight views.
翻译:在许多涉及刚性物体操作的自动化任务中,需要获取物体的姿态信息。基于单目RGB或RGB-D传感器的视觉姿态估计因其广泛适用性而尤为流行。然而,单视角姿态估计固有地受限于深度模糊性以及各种现象(如遮挡、自遮挡、反射等)导致的歧义性。多视角信息聚合有潜力解决这些歧义性,但当前最先进的多视角姿态估计方法仅使用多视图来聚合单视角姿态估计结果,因此依赖于获得良好的单视角估计值。我们提出一种多视角姿态估计方法,该方法在初始估计与可选精化阶段均聚合来自多视角的学得2D-3D分布。通过利用隐式训练以尊重视觉歧义性(如对称性)的学得2D-3D对应分布,我们在对极约束下执行3D-3D对应关系的概率采样。在T-LESS数据集上的评估表明,与最佳单视角方法相比,我们的方法将姿态估计误差降低80-91%;在四视角条件下,我们展现了T-LESS数据集上的最先进结果,甚至优于使用五视角和八视角的方法。