The ability of robots to autonomously navigate through 3D environments depends on their comprehension of spatial concepts, ranging from low-level geometry to high-level semantics, such as objects, places, and buildings. To enable such comprehension, 3D scene graphs have emerged as a robust tool for representing the environment as a layered graph of concepts and their relationships. However, building these representations using monocular vision systems in real-time remains a difficult task that has not been explored in depth. This paper puts forth a real-time spatial perception system Mono-Hydra, combining a monocular camera and an IMU sensor setup, focusing on indoor scenarios. However, the proposed approach is adaptable to outdoor applications, offering flexibility in its potential uses. The system employs a suite of deep learning algorithms to derive depth and semantics. It uses a robocentric visual-inertial odometry (VIO) algorithm based on square-root information, thereby ensuring consistent visual odometry with an IMU and a monocular camera. This system achieves sub-20 cm error in real-time processing at 15 fps, enabling real-time 3D scene graph construction using a laptop GPU (NVIDIA 3080). This enhances decision-making efficiency and effectiveness in simple camera setups, augmenting robotic system agility. We make Mono-Hydra publicly available at: https://github.com/UAV-Centre-ITC/Mono_Hydra
翻译:摘要:机器人自主导航通过三维环境的能力依赖于其对空间概念的理解,这些概念涵盖从低层几何信息到高层语义信息(如物体、地点和建筑物)。为实现这种理解,三维场景图已成为一种鲁棒工具,能将环境表示为概念及其关系的分层图。然而,利用单目视觉系统实时构建这些表示仍是一项尚未深入探索的艰巨任务。本文提出一种实时空间感知系统Mono-Hydra,结合单目相机与IMU传感器配置,主要聚焦于室内场景。不过,所提方法亦可适应室外应用,在潜在用途上具有灵活性。该系统采用一系列深度学习算法来提取深度和语义信息,并基于平方根信息使用机器人中心视觉惯性里程计(VIO)算法,从而确保在IMU与单目相机下保持一致的视觉里程计。该系统能在15帧/秒的实时处理中实现低于20厘米的误差,借助笔记本电脑GPU(NVIDIA 3080)即可完成实时三维场景图构建。这提升了简单相机配置下的决策效率与有效性,增强了机器人系统的敏捷性。我们已将Mono-Hydra开源于:https://github.com/UAV-Centre-ITC/Mono_Hydra