Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning-based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. Additionally, we demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs - up to 8.6% compared to the 2D-only approach.
翻译:跨同一物体的不同图像寻找局部对应关系对于理解其几何结构至关重要。近年来,随着基于深度学习局部图像特征和可学习匹配器的出现,该问题取得了显著进展。然而,当图像对之间仅存在较小的共视区域(即宽基线相机)时,可学习匹配器往往表现不佳。为解决此问题,我们利用粗尺度单视图几何估计方法的最新进展,提出LFM-3D——一种基于图神经网络的可学习特征匹配框架,通过集成带噪声的估计三维信号来增强其对应关系估计能力。将三维信号集成到匹配模型时,我们证明合适的几何位置编码对于有效利用低维三维信息至关重要。我们实验了两种不同的三维信号——归一化物体坐标和单目深度估计——并在包含宽基线物体中心图像对的大规模(合成与真实)数据集上评估了方法。相比纯二维方法,我们观察到了显著的特征匹配提升:在固定召回率下,总召回率提升最高达6%,精确率提升最高达28%。此外,我们证明改进后的对应关系能使野外图像对的相对位姿估计精度大幅提升——相比纯二维方法最高提升8.6%。