We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
翻译:我们提出了一种统一的形式化方法和模型,用于处理三个运动与三维感知任务:光流、校正立体匹配以及基于位姿图像的非校正立体深度估计。与以往针对每个具体任务设计的专用架构不同,我们将这三个任务统一表述为一个密集对应匹配问题,可通过单一模型直接比较特征相似度来解决。该形式化方法要求具有判别性的特征表示,我们通过Transformer(特别是交叉注意力机制)来实现这一目标。实验证明,交叉注意力机制能够通过跨视角交互整合另一图像的知识,从而显著提升特征提取的质量。由于模型架构和参数在所有任务间共享,我们的统一模型天然支持跨任务迁移。在具有挑战性的Sintel数据集上,我们的统一模型优于RAFT;最终模型在仅增加少量任务特定精化步骤的情况下,在10个流行的光流、立体视觉和深度数据集上的表现优于或媲美最新方法,同时在模型设计和推理速度上更为简洁高效。