In this paper, we propose an end-to-end framework that jointly learns keypoint detection, descriptor representation and cross-frame matching for the task of image-based 3D localization. Prior art has tackled each of these components individually, purportedly aiming to alleviate difficulties in effectively train a holistic network. We design a self-supervised image warping correspondence loss for both feature detection and matching, a weakly-supervised epipolar constraints loss on relative camera pose learning, and a directional matching scheme that detects key-point features in a source image and performs coarse-to-fine correspondence search on the target image. We leverage this framework to enforce cycle consistency in our matching module. In addition, we propose a new loss to robustly handle both definite inlier/outlier matches and less-certain matches. The integration of these learning mechanisms enables end-to-end training of a single network performing all three localization components. Bench-marking our approach on public data-sets, exemplifies how such an end-to-end framework is able to yield more accurate localization that out-performs both traditional methods as well as state-of-the-art weakly supervised methods.
翻译:本文提出了一种端到端框架,该框架联合学习基于图像的3D定位任务中的关键点检测、描述子表示与跨帧匹配。现有技术分别处理上述各个组件,声称旨在缓解整体网络有效训练的困难。我们设计了一种自监督图像形变对应损失用于特征检测与匹配,一种弱监督对极几何约束损失用于相对相机位姿学习,以及一种定向匹配方案,该方案在源图像中检测关键点特征并在目标图像上执行由粗到细的对应搜索。我们利用此框架在匹配模块中强制实现循环一致性。此外,我们提出了一种新的损失函数,以稳健地处理明确的内点/外点匹配以及不确定性较高的匹配。这些学习机制的整合使得单一网络能够进行端到端训练,同时执行所有三个定位组件。在公共数据集上对我们的方法进行基准测试,证明了这种端到端框架能够产生比传统方法及最先进的弱监督方法更精确的定位结果。