Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.
翻译:流式三维重建旨在从视频流中恢复三维信息(如相机姿态和点云),这要求几何精度、时间一致性以及计算效率。受同时定位与地图构建(SLAM)原理的启发,我们提出LingBot-Map——一种基于几何上下文Transformer(GCT)架构的、用于从流数据中重建场景的前馈式三维基础模型。LingBot-Map的一个关键特征在于其精心设计的注意力机制,该机制整合了锚点上下文、姿态参考窗口和轨迹记忆,分别应对坐标基准对齐、密集几何线索提取以及长程漂移修正。这一设计使得流式状态保持紧凑,同时保留丰富的几何上下文,从而在超过10,000帧的长序列上,对518×378分辨率的输入实现约20 FPS的稳定高效推理。在多种基准上的广泛评估表明,与现有基于流式方法和迭代优化方法相比,我们的方法均取得了更优的性能。