This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. In the context of dense matching, many works benefit from one of two forms of aggregation: feature aggregation, which pertains to the alignment of similar features, or cost aggregation, a procedure aimed at instilling coherence in the flow estimates across neighboring pixels. In this work, we first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. We then introduce a simple yet effective architecture that harnesses self- and cross-attention mechanisms to show that our approach unifies feature aggregation and cost aggregation and effectively harnesses the strengths of both techniques. Within the proposed attention layers, the features and cost volume both complement each other, and the attention layers are interleaved through a coarse-to-fine design to further promote accurate correspondence estimation. Finally at inference, our network produces multi-scale predictions, computes their confidence scores, and selects the most confident flow for final prediction. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
翻译:本文提出了一种基于Transformer的集成特征与代价聚合网络,专为密集匹配任务设计。在密集匹配领域,许多工作受益于两种聚合形式之一:特征聚合(涉及相似特征的校准)或代价聚合(旨在促进相邻像素间流估计一致性的过程)。本文首先揭示特征聚合与代价聚合具有不同特性,并证明合理运用两种聚合过程可带来显著收益。随后我们提出一种简洁而高效的架构,通过自注意力与交叉注意力机制,统一了特征聚合与代价聚合,有效融合了两种技术的优势。在所提出的注意力层中,特征与代价体相互补充,并通过从粗到细的层级交错设计,进一步提升了对应估计的精度。在推理阶段,网络生成多尺度预测,计算其置信度分数,并选择最可靠的流作为最终预测。本框架在语义匹配标准基准上进行了评估,同时应用于几何匹配任务,结果表明相较于现有方法取得了显著改进。