Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among the most popular image matching methods. While SDF methods are trained to establish correspondences between two images, their performances are almost exclusively evaluated using relative pose estimation metrics. Thus, the link between their ability to establish correspondences and the quality of the resulting estimated pose has thus far received little attention. This paper is a first attempt to study this link. We start with proposing a novel structured attention-based image matching architecture (SAM). It allows us to show a counter-intuitive result on two datasets (MegaDepth and HPatches): on the one hand SAM either outperforms or is on par with SDF methods in terms of pose/homography estimation metrics, but on the other hand SDF approaches are significantly better than SAM in terms of matching accuracy. We then propose to limit the computation of the matching accuracy to textured regions, and show that in this case SAM often surpasses SDF methods. Our findings highlight a strong correlation between the ability to establish accurate correspondences in textured regions and the accuracy of the resulting estimated pose/homography. Our code will be made available.
翻译:半稠密无检测器方法(SDF),如LoFTR,是目前最流行的图像匹配方法之一。尽管SDF方法旨在建立两幅图像之间的对应关系,但其性能几乎完全通过相对位姿估计指标进行评估。因此,它们建立对应关系的能力与由此估计的位姿质量之间的关联迄今很少受到关注。本文首次尝试研究这一关联。我们首先提出了一种新颖的基于结构化注意力的图像匹配架构(SAM)。这使得我们在两个数据集(MegaDepth和HPatches)上展示了一个反直觉的结果:一方面,SAM在位姿/单应性估计指标上优于或持平于SDF方法;另一方面,SDF方法在匹配精度上显著优于SAM。随后,我们建议将匹配精度计算限制在纹理区域,并表明在这种情况下SAM通常超过SDF方法。我们的发现强调了在纹理区域建立精确对应关系的能力与由此估计的位姿/单应性精度之间的强相关性。我们的代码将公开提供。