The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression Transformer(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.
翻译:多视图立体视觉(MVS)的核心是参考像素与源像素间的匹配过程。代价聚合在此过程中发挥关键作用,而现有方法主要依赖卷积神经网络(CNN)实现。这种方法可能受限于CNN的固有局限:由于局部感受野有限,难以区分重复或错误的匹配。为解决该问题,我们尝试将Transformer引入代价聚合。然而,Transformer二次增长的运算复杂度会导致内存溢出和推理延迟。本文通过构建高效的基于Transformer的代价聚合网络CostFormer突破这些限制。我们提出残差深度感知代价Transformer(RDACT),通过沿深度和空间维度的自注意力机制对代价体进行长程特征聚合。同时提出残差回归Transformer(RRT)以增强空间注意力。该方法作为通用插件,可有效提升基于学习的MVS方法性能。