We study the problem of aligning a video that captures a local portion of an environment to the 2D LiDAR scan of the entire environment. We introduce a method (VioLA) that starts with building a semantic map of the local scene from the image sequence, then extracts points at a fixed height for registering to the LiDAR map. Due to reconstruction errors or partial coverage of the camera scan, the reconstructed semantic map may not contain sufficient information for registration. To address this problem, VioLA makes use of a pre-trained text-to-image inpainting model paired with a depth completion model for filling in the missing scene content in a geometrically consistent fashion to support pose registration. We evaluate VioLA on two real-world RGB-D benchmarks, as well as a self-captured dataset of a large office scene. Notably, our proposed scene completion module improves the pose registration performance by up to 20%.
翻译:我们研究将拍摄环境局部区域的视频与整个环境的二维LiDAR扫描对齐的问题。我们提出一种名为VioLA的方法:首先从图像序列构建局部场景的语义地图,然后提取固定高度的点以与LiDAR地图配准。由于重建误差或相机扫描覆盖不全,重建的语义地图可能不包含足够的配准信息。为解决此问题,VioLA利用预训练的文本到图像修复模型配合深度补全模型,以几何一致的方式填充缺失的场景内容,从而支持位姿配准。我们在两个真实世界的RGB-D基准测试以及自采集的大型办公室场景数据集上评估了VioLA。值得注意的是,我们提出的场景补全模块可将位姿配准性能提升高达20%。