Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile devices, limiting their deployment in real-time applications. To address this, we propose a Multi-frequency Adaptive Fusion Network (MAFNet), which can produce high-quality disparity maps using only efficient 2D convolutions. Specifically, we design an adaptive frequency-domain filtering attention module that decomposes the full cost volume into high-frequency and low-frequency volumes, performing frequency-aware feature aggregation separately. Subsequently, we introduce a Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information, yielding more robust disparity estimation. Extensive experiments demonstrate that the proposed MAFNet significantly outperforms existing real-time methods on public datasets such as Scene Flow and KITTI 2015, showing a favorable balance between accuracy and real-time performance.
翻译:现有立体匹配网络通常依赖于基于三维卷积的代价体构建或基于迭代优化的形变方法。前者在代价聚合过程中产生显著计算开销,而后者往往缺乏建模非局部上下文信息的能力。这些方法在资源受限的移动设备上兼容性较差,限制了其在实时应用中的部署。为此,我们提出一种多频自适应融合网络(MAFNet),该网络仅需使用高效的二维卷积即可生成高质量视差图。具体而言,我们设计了一种自适应频域滤波注意力模块,将完整代价体分解为高频与低频分量,分别进行频率感知的特征聚合。随后,我们引入基于Linformer的低秩注意力机制来自适应融合高频与低频信息,从而获得更鲁棒的视差估计。大量实验表明,所提出的MAFNet在Scene Flow和KITTI 2015等公开数据集上显著优于现有实时方法,在精度与实时性能之间展现出优越的平衡性。