Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among the most popular image matching methods. While SDF methods are trained to establish correspondences between two images, their performances are almost exclusively evaluated using relative pose estimation metrics. Thus, the link between their ability to establish correspondences and the quality of the resulting estimated pose has thus far received little attention. This paper is a first attempt to study this link. We start with proposing a novel structured attention-based image matching architecture (SAM). It allows us to show a counter-intuitive result on two datasets (MegaDepth and HPatches): on the one hand SAM either outperforms or is on par with SDF methods in terms of pose/homography estimation metrics, but on the other hand SDF approaches are significantly better than SAM in terms of matching accuracy. We then propose to limit the computation of the matching accuracy to textured regions, and show that in this case SAM often surpasses SDF methods. Our findings highlight a strong correlation between the ability to establish accurate correspondences in textured regions and the accuracy of the resulting estimated pose/homography. Our code will be made available.
翻译:半稠密无检测器方法(SDF,如LoFTR)是目前最流行的图像匹配方法之一。尽管SDF方法旨在建立两幅图像之间的对应关系,但其性能几乎仅通过相对位姿估计指标进行评估。因此,其建立对应关系的能力与最终估计位姿质量之间的联系迄今鲜少受到关注。本文首次尝试探索这一关联。我们首先提出一种基于结构化注意力的新型图像匹配架构(SAM)。该架构在两个数据集(MegaDepth和HPatches)上揭示了一个反直觉的结果:一方面,SAM在位姿/单应性估计指标上优于或媲美SDF方法;另一方面,SDF方法在匹配精度上显著优于SAM。接着,我们提出将匹配精度计算限制于纹理区域,并表明在此情况下SAM往往超越SDF方法。我们的发现揭示了纹理区域准确对应关系建立能力与最终估计位姿/单应性精度之间的强相关性。相关代码将予以开源。