We present three multi-scale similarity learning architectures, or DeepSim networks. These models learn pixel-level matching with a contrastive loss and are agnostic to the geometry of the considered scene. We establish a middle ground between hybrid and end-to-end approaches by learning to densely allocate all corresponding pixels of an epipolar pair at once. Our features are learnt on large image tiles to be expressive and capture the scene's wider context. We also demonstrate that curated sample mining can enhance the overall robustness of the predicted similarities and improve the performance on radiometrically homogeneous areas. We run experiments on aerial and satellite datasets. Our DeepSim-Nets outperform the baseline hybrid approaches and generalize better to unseen scene geometries than end-to-end methods. Our flexible architecture can be readily adopted in standard multi-resolution image matching pipelines.
翻译:我们提出了三种多尺度相似性学习架构,即DeepSim网络。这些模型通过对比损失学习像素级匹配,且对所考虑场景的几何结构具有不变性。我们通过在极线对中一次性密集分配所有对应像素,在混合方法与端到端方法之间建立了中间立场。我们的特征在大图像块上学习,以具备表达性并捕捉场景的广泛上下文。我们还证明了精心设计的样本挖掘可以增强预测相似度的整体鲁棒性,并提升在辐射均匀区域上的性能。我们在航空和卫星数据集上进行了实验。我们的DeepSim-Nets优于基线混合方法,并且比端到端方法对未见过的场景几何结构具有更好的泛化能力。我们的灵活架构可以轻松应用于标准的多分辨率图像匹配流程中。