Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.
翻译:在大型场景中,当深度、姿态和辐射场被孤立求解时,基于单目视频的光照真实三维重建会失效:尺度模糊的深度导致伪影几何,长时程姿态漂移破坏对齐,而单一全局神经辐射场无法建模数百米范围的内容。我们提出了一种联合学习框架,将三个因素耦合起来,并显著克服了每种失效情况。我们的系统从具有度量尺度监督训练的视觉Transformer(ViT)深度网络开始,即使在宽视场变化下也能提供全局一致的深度。多尺度特征束调整层直接在特征空间中优化相机姿态——利用学习到的金字塔描述符而非脆弱的特征点——以抑制无约束轨迹上的漂移。对于场景表示,我们部署了增量式局部辐射场层次结构:当视图重叠度低于阈值时,系统动态分配并冻结新的哈希网格神经辐射场,从而在单个GPU上实现城市街区尺度的覆盖。在Tanks and Temples基准测试中,我们的方法在八个室内外序列上将绝对轨迹误差降低至0.001-0.021米——比BARF降低高达18倍,比NoPe-NeRF降低2倍——同时保持亚像素级相对姿态误差。这些结果表明,从单一未标定RGB相机实现度量尺度、无漂移的三维重建与高保真新视角合成是可行的。