Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.
翻译:当前三维场景图生成方法依赖于专用深度传感器(如LiDAR或RGB-D相机)进行度量三维重建,这限制了其在专用机器人平台上的部署,并排除了仅配备RGB相机(如固定外部基础设施)的场景。现有流程通常基于被动采集的观测轨迹运行,而非根据局部构建的场景表征选择视角,因此无法在探索过程中有效利用图中编码的语义和空间信息。本文提出一种全视觉框架,用于仅从RGB输入主动、增量式构建三维场景图,解决了上述两个局限。所提方法围绕统一的结构化表征统一感知与规划,该表征捕获对象语义、三维几何、关系上下文及多视角信息。由于该框架与硬件无关且仅依赖RGB观测,可在同一表征中融合机载机器人相机与固定外部相机的输入。在Replica数据集上的实验表明,纯RGB流程在使用真实深度基线的条件下实现了F1分数对等。在ReplicaCAD上的主动探索实验进一步证明,在相同探索预算下,基于语义驱动的视角选择检测到的对象数量是基于几何前沿基线的两倍以上。最后,外部相机实验表明,互补RGB视图能有效引导场景图生成,并在不增加探索成本的前提下提升上下文理解能力。