Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks, and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. We additionally demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs, with a more than 8% boost compared to the 2D-only approach.
翻译:从不同图像中定位同一物体的局部对应关系对于理解其几何结构至关重要。近年来,随着基于深度学习的局部图像特征和可学习匹配器的发展,这一问题取得了显著进展。然而,当图像对间仅存在小范围共视区域(即宽相机基线)时,可学习匹配器的性能往往不佳。为解决此问题,我们利用粗粒度单视图几何估计方法的最新进展,提出LFM-3D——一种基于图神经网络模型的可学习特征匹配框架,并通过整合含噪的估计三维信号增强其对应估计能力。将三维信号集成到匹配器模型时,我们发现合适的位姿编码对于有效利用低维三维信息至关重要。我们实验了两种不同的三维信号——归一化物体坐标与单目深度估计——并在包含大基线物体中心图像对的大规模(合成及真实)数据集上评估了所提方法。与仅基于二维的方法相比,我们观察到显著的特征匹配性能提升:总召回率提升达6%,固定召回率下的精度提升达28%。此外,我们证明改进后的对应关系可显著提升野外图像对的相对位姿估计精度,相比仅用二维方法获得超过8%的提升。