Multi-view depth estimation has achieved impressive performance over various benchmarks. However, almost all current multi-view systems rely on given ideal camera poses, which are unavailable in many real-world scenarios, such as autonomous driving. In this work, we propose a new robustness benchmark to evaluate the depth estimation system under various noisy pose settings. Surprisingly, we find current multi-view depth estimation methods or single-view and multi-view fusion methods will fail when given noisy pose settings. To address this challenge, we propose a single-view and multi-view fused depth estimation system, which adaptively integrates high-confident multi-view and single-view results for both robust and accurate depth estimations. The adaptive fusion module performs fusion by dynamically selecting high-confidence regions between two branches based on a wrapping confidence map. Thus, the system tends to choose the more reliable branch when facing textureless scenes, inaccurate calibration, dynamic objects, and other degradation or challenging conditions. Our method outperforms state-of-the-art multi-view and fusion methods under robustness testing. Furthermore, we achieve state-of-the-art performance on challenging benchmarks (KITTI and DDAD) when given accurate pose estimations. Project website: https://github.com/Junda24/AFNet/.
翻译:多视角深度估计在各类基准测试中已取得显著性能。然而,当前几乎所有的多视角系统都依赖于给定的理想相机位姿,这在自动驾驶等实际场景中往往无法获取。本文提出了一种新的鲁棒性基准,用于评估深度估计系统在不同噪声位姿设定下的表现。令人惊讶的是,我们发现当前的多视角深度估计方法或单目与多视角融合方法在噪声位姿设定下均会失效。为解决这一挑战,我们提出了一种单目与多视角融合的深度估计系统,该系统通过自适应地整合高置信度的多视角与单目结果,实现既鲁棒又精确的深度估计。自适应融合模块基于包裹置信度图,动态地在两个分支间选择高置信区域进行融合。因此,当面临无纹理场景、标定不准确、动态物体及其他退化或挑战性条件时,系统倾向于选择更可靠的分支。我们的方法在鲁棒性测试中优于当前最先进的多视角与融合方法。此外,在给定精确位姿估计时,我们在具有挑战性的基准(KITTI与DDAD)上达到了最先进的性能。项目网站:https://github.com/Junda24/AFNet/。