The image matching field has been witnessing a continuous emergence of novel learnable feature matching techniques, with ever-improving performance on conventional benchmarks. However, our investigation shows that despite these gains, their potential for real-world applications is restricted by their limited generalization capabilities to novel image domains. In this paper, we introduce OmniGlue, the first learnable image matcher that is designed with generalization as a core principle. OmniGlue leverages broad knowledge from a vision foundation model to guide the feature matching process, boosting generalization to domains not seen at training time. Additionally, we propose a novel keypoint position-guided attention mechanism which disentangles spatial and appearance information, leading to enhanced matching descriptors. We perform comprehensive experiments on a suite of $7$ datasets with varied image domains, including scene-level, object-centric and aerial images. OmniGlue's novel components lead to relative gains on unseen domains of $20.9\%$ with respect to a directly comparable reference model, while also outperforming the recent LightGlue method by $9.5\%$ relatively.Code and model can be found at https://hwjiang1510.github.io/OmniGlue
翻译:图像匹配领域持续涌现出新的可学习特征匹配技术,在传统基准测试中性能不断提升。然而,本研究表明,尽管有所改进,这些方法因对新型图像域的泛化能力有限,在实际应用中的潜力受到制约。本文提出OmniGlue,这是首个以泛化为核心原则设计的可学习图像匹配器。OmniGlue利用视觉基础模型的广泛知识引导特征匹配过程,提升对训练时未见域的泛化能力。此外,我们提出了一种新颖的关键点位置引导注意力机制,该机制将空间与外观信息解耦,从而增强匹配描述符。我们在涵盖场景级、对象中心及航空图像的7个不同图像域数据集上进行了全面实验。OmniGlue的创新组件在未见域上相较于直接可比的参考模型取得了20.9%的相对提升,同时比近期提出的LightGlue方法相对提高了9.5%。代码与模型详见https://hwjiang1510.github.io/OmniGlue。