VOOM: Robust Visual Object Odometry and Mapping using Hierarchical Landmarks

In recent years, object-oriented simultaneous localization and mapping (SLAM) has attracted increasing attention due to its ability to provide high-level semantic information while maintaining computational efficiency. Some researchers have attempted to enhance localization accuracy by integrating the modeled object residuals into bundle adjustment. However, few have demonstrated better results than feature-based visual SLAM systems, as the generic coarse object models, such as cuboids or ellipsoids, are less accurate than feature points. In this paper, we propose a Visual Object Odometry and Mapping framework VOOM using high-level objects and low-level points as the hierarchical landmarks in a coarse-to-fine manner instead of directly using object residuals in bundle adjustment. Firstly, we introduce an improved observation model and a novel data association method for dual quadrics, employed to represent physical objects. It facilitates the creation of a 3D map that closely reflects reality. Next, we use object information to enhance the data association of feature points and consequently update the map. In the visual object odometry backend, the updated map is employed to further optimize the camera pose and the objects. Meanwhile, local bundle adjustment is performed utilizing the objects and points-based covisibility graphs in our visual object mapping process. Experiments show that VOOM outperforms both object-oriented SLAM and feature points SLAM systems such as ORB-SLAM2 in terms of localization. The implementation of our method is available at https://github.com/yutongwangBIT/VOOM.git.

翻译：摘要：近年来，面向目标的同步定位与建图（SLAM）因其在保持计算效率的同时提供高层语义信息的能力而受到广泛关注。部分研究者尝试通过将建模的目标残差集成到光束法平差中来提升定位精度。然而，由于通用粗略目标模型（如立方体或椭球体）的精度低于特征点，鲜有方法能展现出优于基于特征点的视觉SLAM系统的结果。本文提出了一种视觉目标里程计与建图框架VOOM，该框架利用高层目标和低层点作为层次化地标，采用由粗到精的策略，而非直接在光束法平差中使用目标残差。首先，我们针对用于表示物理对象的对偶二次曲面，引入了一种改进的观测模型和一种新颖的数据关联方法。这有助于构建更贴近真实情况的三维地图。接着，我们利用目标信息增强特征点的数据关联，并由此更新地图。在视觉目标里程计后端中，更新后的地图被用于进一步优化相机姿态和目标。同时，在我们的视觉目标建图过程中，基于目标和点的共视图执行局部光束法平差。实验表明，VOOM在定位性能上优于面向目标的SLAM系统以及ORB-SLAM2等基于特征点的SLAM系统。本文方法的实现代码可在https://github.com/yutongwangBIT/VOOM.git获取。