In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.
翻译:在现代密集三维重建中,前馈系统(例如VGGT、pi3)专注于端到端的匹配与几何预测,但并未显式输出新视角合成(NVS)。基于神经渲染的方法能够从已标定姿态的图像中提供高保真的NVS与精细几何,然而它们通常假设相机姿态固定,且对姿态误差较为敏感。因此,从随意采集的视图中获取一个能够同时提供准确姿态、可靠深度、高质量渲染与精确三维表面的单一框架,仍然是一项非平凡的任务。本文提出NeVStereo,一种NeRF驱动的NVS-立体架构,旨在从多视角仅RGB输入中联合输出相机姿态、多视角深度、新视角合成以及表面重建。NeVStereo结合了基于NeRF的NVS以生成适用于立体匹配的渲染、置信度引导的多视角深度估计、用于姿态优化的NeRF耦合光束法平差,以及一个迭代优化阶段,该阶段同时更新深度与辐射场以提升几何一致性。该设计缓解了基于NeRF的常见问题,如表面堆叠、伪影以及姿态-深度耦合。在室内、室外、桌面及航空基准测试中,我们的实验表明,与现有知名方法相比,NeVStereo实现了持续强劲的零样本性能,其深度误差降低高达36%,姿态精度提升10.4%,NVS保真度提高4.5%,并达到了最先进的网格质量(F1 91.93%,Chamfer 4.35 mm)。