Learning-based multi-view stereo (MVS) methods deal with predicting accurate depth maps to achieve an accurate and complete 3D representation. Despite the excellent performance, existing methods ignore the fact that a suitable depth geometry is also critical in MVS. In this paper, we demonstrate that different depth geometries have significant performance gaps, even using the same depth prediction error. Therefore, we introduce an ideal depth geometry composed of Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward around the ground-truth surface, rather than maintaining a continuous and smooth depth plane. To achieve it, we develop a coarse-to-fine framework called Dual-MVSNet (DMVSNet), which can produce an oscillating depth plane. Technically, we predict two depth values for each pixel (Dual-Depth), and propose a novel loss function and a checkerboard-shaped selecting strategy to constrain the predicted depth geometry. Compared to existing methods,DMVSNet achieves a high rank on the DTU benchmark and obtains the top performance on challenging scenes of Tanks and Temples, demonstrating its strong performance and generalization ability. Our method also points to a new research direction for considering depth geometry in MVS.
翻译:基于学习的多视角立体视觉方法致力于预测精确的深度图,以实现准确且完整的3D表征。尽管现有方法表现出色,但它们忽略了合适的深度几何结构在多视角立体视觉中的关键作用。本文证明,即使采用相同的深度预测误差,不同深度几何结构也会导致显著的性能差异。为此,我们提出由鞍形深度单元构成的理想深度几何结构:其预测深度图围绕真实表面上下振荡,而非保持连续平滑的深度平面。为实现这一目标,我们开发了名为Dual-MVSNet(DMVSNet)的由粗到精框架,可生成振荡式深度平面。在技术层面,我们为每个像素预测两个深度值(双深度),并设计新型损失函数与棋盘状选择策略来约束预测深度几何结构。与现有方法相比,DMVSNet在DTU基准上取得高排名,并在Tanks and Temples数据集的挑战性场景中达到最优性能,展现出强大的表现力与泛化能力。本研究为多视角立体视觉中深度几何结构的考量指明了新研究方向。