This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.
翻译:本文研究基于长视频序列的大规模三维场景重建任务。近期的前馈式重建模型通过直接从RGB图像回归三维几何结构,无需显式三维先验或几何约束,取得了显著成果。然而,受限于有限的内存容量与难以有效捕获全局上下文线索的能力,这些方法在长序列中常难以维持重建精度与一致性。相比之下,人类能自然运用对场景的全局理解来指导局部感知。受此启发,我们提出一种新型神经全局上下文表示,可高效压缩并保留长程场景信息,使模型能够利用丰富的上下文线索提升重建精度与一致性。该上下文表示通过一组轻量级神经子网络实现,这些子网络在测试时通过自监督目标快速自适应调整,从而在不造成显著计算开销的前提下大幅提升内存容量。在包含KITTI Odometry~\cite{Geiger2012CVPR}与Oxford Spires~\cite{tao2025spires}数据集在内的多个大规模基准上的实验表明,本方法在处理超大规模场景时表现出色,在保持高效性的同时实现了领先的位姿精度与最先进的三维重建精度。代码已开源:https://zju3dv.github.io/scal3r。