We introduce a novel problem, i.e., the localization of an input image within a multi-modal reference map represented by a database of 3D scene graphs. These graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given the available modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing an object instance) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map embeddings. When images are leveraged, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. The code will be made public.
翻译:本文提出了一种新颖的问题,即基于以三维场景图数据库形式表示的多模态参考地图,对输入图像进行定位。这些场景图包含多种模态,包括物体级点云、图像、属性以及物体间的关系,为依赖大规模图像数据库的传统方法提供了一种轻量且高效的替代方案。基于可用的模态,所提出的方法 SceneGraphLoc 为场景图中的每个节点(即表示一个物体实例)学习一个固定大小的嵌入表示,从而能够与输入查询图像中可见的物体进行有效匹配。该策略显著优于其他跨模态方法,即使未将图像信息纳入地图嵌入中也是如此。当利用图像信息时,SceneGraphLoc 的性能接近依赖于大型图像数据库的最先进技术,同时所需存储空间减少三个数量级,且运行速度快数个数量级。代码将公开。