Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.
翻译:从多视角图像计算精确深度是计算机视觉领域一个基础且长期存在的挑战。然而,现有的大多数方法在不同领域和场景类型(例如室内与室外)之间泛化能力不佳。训练一个通用的多视角立体模型具有挑战性,并引发了一系列问题,例如:如何最有效地利用基于Transformer的架构;当输入视角数量可变时,如何整合额外的元数据;以及如何估计有效深度范围——该范围在不同场景间差异巨大且通常先验未知?为解决这些问题,我们提出了MVSA,一种新颖且通用的多视角立体架构,旨在通过泛化于不同领域和深度范围来实现“随处可用”。MVSA结合了单目与多视角线索,并采用自适应代价体来处理与尺度相关的问题。我们在鲁棒多视角深度基准测试中展示了最先进的零样本深度估计性能,超越了现有的多视角立体和单目基线方法。