Semantic-aware Representation Learning for Homography Estimation

Homography estimation is the task of determining the transformation from an image pair. Our approach focuses on employing detector-free feature matching methods to address this issue. Previous work has underscored the importance of incorporating semantic information, however there still lacks an efficient way to utilize semantic information. Previous methods suffer from treating the semantics as a pre-processing, causing the utilization of semantics overly coarse-grained and lack adaptability when dealing with different tasks. In our work, we seek another way to use the semantic information, that is semantic-aware feature representation learning framework.Based on this, we propose SRMatcher, a new detector-free feature matching method, which encourages the network to learn integrated semantic feature representation.Specifically, to capture precise and rich semantics, we leverage the capabilities of recently popularized vision foundation models (VFMs) trained on extensive datasets. Then, a cross-images Semantic-aware Fusion Block (SFB) is proposed to integrate its fine-grained semantic features into the feature representation space. In this way, by reducing errors stemming from semantic inconsistencies in matching pairs, our proposed SRMatcher is able to deliver more accurate and realistic outcomes. Extensive experiments show that SRMatcher surpasses solid baselines and attains SOTA results on multiple real-world datasets. Compared to the previous SOTA approach GeoFormer, SRMatcher increases the area under the cumulative curve (AUC) by about 11% on HPatches. Additionally, the SRMatcher could serve as a plug-and-play framework for other matching methods like LoFTR, yielding substantial precision improvement.

翻译：单应性估计旨在确定图像对之间的变换关系。本研究聚焦于采用无检测器特征匹配方法解决该问题。先前研究已强调语义信息融合的重要性，但尚未形成高效的语义利用机制。现有方法通常将语义处理作为预处理步骤，导致语义利用粒度粗糙且缺乏跨任务适应性。本工作提出新的语义利用范式——语义感知特征表示学习框架。基于此，我们提出SRMatcher新型无检测器特征匹配方法，推动网络学习融合语义的特征表示。具体而言，为获取精确丰富的语义信息，我们利用基于海量数据训练的视觉基础模型（VFMs）的强大能力。进而设计跨图像语义感知融合模块（SFB），将细粒度语义特征整合至特征表示空间。通过降低匹配对中语义不一致导致的误差，SRMatcher能够产生更精确可靠的结果。大量实验表明，SRMatcher在多个真实数据集上超越现有基线方法并达到最先进水平。相较于先前最优方法GeoFormer，SRMatcher在HPatches数据集上将累积曲线下面积（AUC）提升约11%。此外，SRMatcher可作为即插即用框架适配LoFTR等其他匹配方法，带来显著的精度提升。