Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new datasets with different characteristics to those they were trained on, unlike classic feature extractors. Instead, they require finetuning, which assumes that ground-truth correspondences or ground-truth camera poses and 3D structure are available. We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry. We do so by replacing correspondence losses with epipolar losses, which encourage putative matches to lie on the associated epipolar line. While weaker than correspondence supervision, we observe that this cue is sufficient for finetuning existing models on new data. We then further relax the assumption of known camera poses by using pose estimates in a novel bootstrapping approach. We evaluate on highly challenging datasets, including an indoor drone dataset and an outdoor smartphone camera dataset, and obtain state-of-the-art results without strong supervision.
翻译:从场景的两个或多个视图中提取点对应关系是计算机视觉中的基本问题,对相对相机位姿估计和运动恢复结构尤为关键。现有基于大规模数据集进行对应监督训练的局部特征匹配方法,在测试集上可获得高度精确的匹配结果。然而与经典特征提取器不同,这些方法难以泛化到具有不同特征的训练集外新数据集。它们需要微调,这要求具备真实对应关系或真实相机位姿与三维结构信息。我们通过移除三维结构(如深度图或点云)需求、仅需可从里程计获取的相机位姿信息来放宽这一假设。具体而言,我们将对应损失替换为对极损失,促使候选匹配点落在对应极线上。虽然对极监督弱于对应监督,但我们观察到该线索足以在新型数据上微调现有模型。进一步地,我们通过新颖的自举方法使用位姿估计值来放宽已知相机位姿的假设。我们在极具挑战性的数据集(包括室内无人机数据集和室外智能手机相机数据集)上进行评估,取得了无需强监督的最优结果。