We present SceneVGGT, a spatio-temporal 3D scene understanding framework that combines SLAM with semantic mapping for autonomous and assistive navigation. Built on VGGT, our method scales to long video streams via a sliding-window pipeline. We align local submaps using camera-pose transformations, enabling memory- and speed-efficient mapping while preserving geometric consistency. Semantics are lifted from 2D instance masks to 3D objects using the VGGT tracking head, maintaining temporally coherent identities for change detection. As a proof of concept, object locations are projected onto an estimated floor plane for assistive navigation. The pipeline's GPU memory usage remains under 17 GB, irrespectively of the length of the input sequence and achieves competitive point-cloud performance on the ScanNet++ benchmark. Overall, SceneVGGT ensures robust semantic identification and is fast enough to support interactive assistive navigation with audio feedback.
翻译:本文提出SceneVGGT,一种结合SLAM与语义建图的时空三维场景理解框架,旨在实现自主与辅助导航。该方法基于VGGT构建,通过滑动窗口流程可扩展至长视频流。我们利用相机位姿变换对齐局部子地图,在保持几何一致性的同时实现内存与速度高效的建图。语义信息通过VGGT跟踪头从二维实例掩码提升至三维物体,并维持时序一致的身份标识以支持变化检测。作为概念验证,物体位置被投影至估计的地板平面以辅助导航。该流程的GPU内存使用量始终低于17GB,且不受输入序列长度影响,并在ScanNet++基准测试中取得了具有竞争力的点云性能。总体而言,SceneVGGT确保了鲁棒的语义识别,其速度足以支持带音频反馈的交互式辅助导航。