Dense 3D reconstruction and ego-motion estimation are key challenges in autonomous driving and robotics. Compared to the complex, multi-modal systems deployed today, multi-camera systems provide a simpler, low-cost alternative. However, camera-based 3D reconstruction of complex dynamic scenes has proven extremely difficult, as existing solutions often produce incomplete or incoherent results. We propose R3D3, a multi-camera system for dense 3D reconstruction and ego-motion estimation. Our approach iterates between geometric estimation that exploits spatial-temporal information from multiple cameras, and monocular depth refinement. We integrate multi-camera feature correlation and dense bundle adjustment operators that yield robust geometric depth and pose estimates. To improve reconstruction where geometric depth is unreliable, e.g. for moving objects or low-textured regions, we introduce learnable scene priors via a depth refinement network. We show that this design enables a dense, consistent 3D reconstruction of challenging, dynamic outdoor environments. Consequently, we achieve state-of-the-art dense depth prediction on the DDAD and NuScenes benchmarks.
翻译:密集三维重建与自运动估计是自动驾驶和机器人领域的关键挑战。相较于当前部署的复杂多模态系统,多摄像头系统提供了一种更简单、低成本的替代方案。然而,基于摄像头的复杂动态场景三维重建已被证明极其困难,现有解决方案常产生不完整或不连贯的结果。我们提出R3D3——一种用于密集三维重建和自运动估计的多摄像头系统。该方法交替执行两方面操作:一是利用多摄像头时空信息进行几何估计,二是进行单目深度细化。我们集成了多摄像头特征关联与密集光束法平差算子,可生成鲁棒的几何深度与位姿估计。为改善几何深度不可靠区域(如运动物体或弱纹理区域)的重建效果,我们通过深度细化网络引入可学习场景先验。实验表明,该设计能对具有挑战性的动态户外环境实现密集、一致的三维重建。最终,我们在DDAD和NuScenes基准测试中达到了最先进的密集深度预测性能。