Visual localization in complex indoor environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract per-patch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be available at https://github.com/DmblnNicole/sg2loc.
翻译:在复杂室内环境中的视觉定位仍然是机器人和增强现实应用面临的关键挑战。对于自主智能体而言,顺序定位(即随时间推移细化位姿估计)具有重要意义。然而,传统方法通常需要存储庞大的图像数据库或点云,导致显著的开销。本文提出了一种新颖的轻量级方法,利用三维场景图实现顺序视觉定位。我们的方法使用紧凑的场景图表示环境,其中节点代表物体(包含粗略网格),边编码空间关系。在定位阶段的每张图像中,我们提取逐块语义特征,预测物体身份。定位在粒子滤波器框架内进行。每个代表相机位姿的粒子将场景图中的粗略物体网格投影到图像中,基于可见性为图像块分配物体身份。输入图像中逐块特征的相似性以及场景图中的物体特征决定了粒子的权重。后续图像被顺序整合,以细化位姿估计。通过利用紧凑的场景图和高效的语义匹配,我们的方法在维持真实世界数据集性能的同时显著降低了存储开销。代码将发布于https://github.com/DmblnNicole/sg2loc。