Correspondence pruning aims to find correct matches (inliers) from an initial set of putative correspondences, which is a fundamental task for many applications. The process of finding is challenging, given the varying inlier ratios between scenes/image pairs due to significant visual differences. However, the performance of the existing methods is usually limited by the problem of lacking visual cues (\eg texture, illumination, structure) of scenes. In this paper, we propose a Visual-Spatial Fusion Transformer (VSFormer) to identify inliers and recover camera poses accurately. Firstly, we obtain highly abstract visual cues of a scene with the cross attention between local features of two-view images. Then, we model these visual cues and correspondences by a joint visual-spatial fusion module, simultaneously embedding visual cues into correspondences for pruning. Additionally, to mine the consistency of correspondences, we also design a novel module that combines the KNN-based graph and the transformer, effectively capturing both local and global contexts. Extensive experiments have demonstrated that the proposed VSFormer outperforms state-of-the-art methods on outdoor and indoor benchmarks.
翻译:对应点筛选旨在从初始假定对应点集中找出正确匹配(内点),这是许多应用中的基础任务。由于场景/图像对之间因显著的视觉差异而导致内点比例变化,筛选过程具有挑战性。然而,现有方法的性能通常因缺乏场景的视觉线索(如纹理、光照、结构)而受限。本文提出一种视觉-空间融合Transformer(VSFormer)以精确识别内点并恢复相机位姿。首先,通过双视图图像局部特征间的交叉注意力获取场景的高度抽象视觉线索;随后,利用联合视觉-空间融合模块对这些视觉线索与对应点进行建模,同时将视觉线索嵌入对应点以实现筛选。此外,为挖掘对应点的一致性,我们设计了一种结合KNN图与Transformer的新型模块,有效捕获局部与全局上下文。大量实验表明,所提出的VSFormer在室外与室内基准测试中均优于现有最优方法。