Optical flow estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. We notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem in optical flow estimation. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.
翻译:光流估计旨在寻找两帧之间的二维密集运动场。由于模型结构和训练数据集的限制,现有方法往往过度依赖局部线索,忽略物体的完整性,导致运动估计碎片化。我们注意到,近期知名的分割一切模型(SAM)展现出强大的完整物体分割能力,这为解决光流估计中的碎片化问题提供了可能。为此,我们提出将冻结的SAM图像编码器嵌入FlowFormer中以增强物体感知能力。针对SAM在非分割任务(如光流估计)中深度应用的挑战,我们提出了光流任务特定适配方案,包括用于融合SAM编码器与光流上下文编码器的上下文融合模块,以及通过可学习的任务特定嵌入将SAM特征适配至光流任务的上下文适配模块。所提出的SAMFlow模型在Sintel和KITTI-15训练集上分别达到0.86/2.10的clean/final端点误差(EPE)及3.55/12.32的EPE/F1-all,相比Flowformer提升8.5%/9.9%和13.2%/16.3%。此外,我们的模型在Sintel和KITTI-15基准测试中达到最优性能,在Sintel clean通道所有双帧方法中排名第一。