Recent advancements in learning-based Multi-View Stereo (MVS) methods have prominently featured transformer-based models with attention mechanisms. However, existing approaches have not thoroughly investigated the profound influence of transformers on different MVS modules, resulting in limited depth estimation capabilities. In this paper, we introduce MVSFormer++, a method that prudently maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline. Formally, our approach involves infusing cross-view information into the pre-trained DINOv2 model to facilitate MVS learning. Furthermore, we employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively. Additionally, we uncover that some design details would substantially impact the performance of transformer modules in MVS, including normalized 3D positional encoding, adaptive attention scaling, and the position of layer normalization. Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method. Notably, MVSFormer++ achieves state-of-the-art performance on the challenging DTU and Tanks-and-Temples benchmarks.
翻译:近期基于学习的多视图立体匹配方法取得了显著进展,其中基于Transformer的注意力机制模型尤为突出。然而,现有方法尚未充分探究Transformer对不同MVS模块的深层影响,导致深度估计能力受限。本文提出MVSFormer++方法,通过审慎最大化注意力机制的内在特性来增强MVS流程的各个组件。具体而言,我们通过将跨视角信息注入预训练的DINOv2模型以促进MVS学习。此外,针对特征编码器和代价体正则化,我们分别采用不同的注意力机制,分别侧重于特征聚合与空间聚合。同时,我们发现归一化三维位置编码、自适应注意力缩放以及层归一化位置等设计细节会显著影响Transformer模块在MVS中的性能。在DTU、Tanks-and-Temples、BlendedMVS和ETH3D数据集上的综合实验验证了该方法的有效性。值得注意的是,MVSFormer++在具有挑战性的DTU和Tanks-and-Temples基准测试中达到了最先进的性能。