Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints. Recently, an approach for synthesizing database images from unseen viewpoints using NeRF (Neural Radiance Fields) technology has emerged with impressive performance. However, synthesized images provided by these techniques are often of lower quality than the original images, and furthermore they significantly increase the storage cost of the database. In this study, we explore a new hybrid scene model that combines the advantages of view-invariant appearance features computed from raw images and view-dependent spatial-semantic features computed from synthesized images. These two types of features are then fused into scene graphs, and compressively learned and recognized by a graph neural network. The effectiveness of the proposed method was verified using a novel cross-view self-localization dataset with many unseen views generated using a photorealistic Habitat simulator.
翻译:跨视角自定位是视觉地点识别中一个具有挑战性的场景,其中数据库图像仅从稀疏视角提供。近期,一种利用NeRF(神经辐射场)技术从未见视角合成数据库图像的方法展现出卓越性能。然而,这些技术提供的合成图像质量通常低于原始图像,并且会显著增加数据库的存储成本。在本研究中,我们探索了一种新型混合场景模型,该模型结合了从原始图像计算的视角不变外观特征与从合成图像计算的视角依赖空间语义特征的优势。这两类特征随后被融合为场景图,并通过图神经网络进行压缩学习与识别。我们使用一个包含大量通过逼真Habitat模拟器生成的未见视角的新型跨视角自定位数据集验证了所提方法的有效性。