The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.
翻译:静态环境假设是SLAM等几何计算机视觉任务中的常见前提,但严重限制了其在高度动态场景中的适用性。针对此类任务依赖输入图像间静态区域对应点识别的问题,我们提出一种基于图神经网络的稀疏特征匹配网络,能够在排除运动物体关键点的同时,在挑战性条件下实现鲁棒匹配。我们采用与当前最优特征匹配网络相似的图边注意力聚合机制增强关键点表征,但通过在图中融入极几何与时间信息,并大幅减少图边的数量。此外,我们提出一种自监督训练方案,仅从未经处理的视觉-惯性数据中提取动态环境图像对的伪标签。系列实验表明,相较于现有最优特征匹配网络,我们的网络在排除运动物体关键点方面表现优异,同时在常规匹配指标上仍能取得相近结果。当集成至SLAM系统时,该网络能显著提升性能,尤其在高度动态场景中效果更为突出。