The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.
翻译:静态环境假设在许多几何计算机视觉任务(如SLAM)中普遍存在,但限制了其在高度动态场景中的适用性。由于这类任务依赖于识别环境中静态部分在输入图像间的点对应关系,我们提出了一种基于图神经网络的稀疏特征匹配网络,旨在挑战性条件下实现鲁棒匹配,同时排除运动物体上的关键点。我们采用与当前最优特征匹配网络类似的图边注意力聚合机制来增强关键点表征,但通过引入极线约束和时序信息对图结构进行增强,并大幅减少了图边的数量。此外,我们提出了一种自监督训练方案,能够仅从未处理的视觉-惯性数据中为动态环境下的图像对提取伪标签。一系列实验表明,相较于当前最优的特征匹配网络,我们的网络在排除运动物体关键点的同时,仍能在传统匹配指标上取得相当的结果,从而展现出优越性能。当集成至SLAM系统时,我们的网络能显著提升系统性能,在高度动态场景中尤为明显。