FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer

Local Feature Matching, an essential component of several computer vision tasks (e.g., structure from motion and visual localization), has been effectively settled by Transformer-based methods. However, these methods only integrate long-range context information among keypoints with a fixed receptive field, which constrains the network from reconciling the importance of features with different receptive fields to realize complete image perception, hence limiting the matching accuracy. In addition, these methods utilize a conventional handcrafted encoding approach to integrate the positional information of keypoints into the visual descriptors, which limits the capability of the network to extract reliable positional encoding message. In this study, we propose Feature Matching with Reconciliatory Transformer (FMRT), a novel Transformer-based detector-free method that reconciles different features with multiple receptive fields adaptively and utilizes parallel networks to realize reliable positional encoding. Specifically, FMRT proposes a dedicated Reconciliatory Transformer (RecFormer) that consists of a Global Perception Attention Layer (GPAL) to extract visual descriptors with different receptive fields and integrate global context information under various scales, Perception Weight Layer (PWL) to measure the importance of various receptive fields adaptively, and Local Perception Feed-forward Network (LPFFN) to extract deep aggregated multi-scale local feature representation. Extensive experiments demonstrate that FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.

翻译：局部特征匹配作为多个计算机视觉任务（如运动恢复结构和视觉定位）的核心组成部分，已通过基于Transformer的方法得到了有效解决。然而，这些方法仅通过固定感受野整合关键点间的长程上下文信息，限制了网络对不同感受野特征重要性进行协调以实现完整图像感知的能力，从而制约了匹配精度。此外，现有方法采用传统手工编码方式将关键点位置信息融入视觉描述符，导致网络难以提取可靠的位置编码信息。本研究提出了一种基于调谐Transformer的特征匹配方法（FMRT），这是一种新颖的免检测器Transformer方法，可自适应协调多感受野特征，并利用并行网络实现可靠的位置编码。具体而言，FMRT设计了专用调谐Transformer（RecFormer），包含全局感知注意力层（GPAL）以提取多感受野视觉描述符并整合跨尺度全局上下文信息、感知权重层（PWL）自适应评估不同感受野的重要性，以及局部感知前馈网络（LPFFN）提取深度聚合的多尺度局部特征表示。大量实验表明，FMRT在多项基准测试中均展现出卓越性能，涵盖位姿估计、视觉定位、单应性估计及图像匹配任务。