Existing monocular depth estimation methods have achieved excellent robustness in diverse scenes, but they can only retrieve affine-invariant depth, up to an unknown scale and shift. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. To solve this problem, we propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points, which ensures the scale consistency along consecutive frames. Extensive experiments show that our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge over 6.3 million RGBD images to train strong and robust depth models. Our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combining with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing the simple per-frame prediction over a video, the accurate 3D scene shape can be recovered.
翻译:现有单目深度估计方法已在多样场景中取得卓越鲁棒性,但其只能恢复仿射不变深度(存在未知尺度与偏移)。然而在视频深度估计和三维场景重建等基于视频的场景中,逐帧预测中存在的未知尺度与偏移可能导致深度不一致性。为解决该问题,我们提出基于局部加权线性回归的方法,通过极稀疏锚点恢复尺度与偏移,确保连续帧间的尺度一致性。大量实验表明,我们的方法可在多个零样本基准测试中将现有最优方法的性能提升最高达50%。此外,我们融合超过630万张RGBD图像训练强大且鲁棒的深度模型。所训练的ResNet50骨干网络模型甚至超越当前最优的DPT ViT-Large模型。结合基于几何的重建方法,我们构建了新型稠密三维场景重建流程,该流程同时受益于稀疏点的尺度一致性和单目方法的鲁棒性。通过简单的视频逐帧预测,即可恢复准确的三维场景形状。