We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.
翻译:本研究重新审视了针对各类局部特征训练基于注意力的稀疏图像匹配模型的问题。我们首先识别出一个先前被忽视的关键设计选择,该选择显著影响了LightGlue模型的性能。随后,我们探究了基于Transformer的匹配框架中检测器与描述符的作用,发现性能差异的主要成因通常在于检测器而非描述符。最后,我们提出了一种新颖的方法,利用来自多种检测器的关键点对现有图像匹配模型进行微调,从而得到一个通用的、与检测器无关的模型。当作为零样本匹配器应用于新型检测器时,所得模型的精度达到或超越了专门针对这些特征训练的模型。我们的研究结果为基于Transformer的匹配模型的部署以及局部特征的未来设计提供了有价值的见解。