We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.
翻译:本文提出FlowIt——一种新型光流估计架构,旨在稳健处理大像素位移问题。该架构核心采用分层Transformer设计,通过捕获全局上下文信息,有效建模长距离对应关系。为突破局部匹配的局限性,我们提出将光流初始化建模为最优传输问题。该公式化方法可生成高度稳健的初始流场,并同时显式推导出遮挡图与置信度图。这些引导信息被无缝集成至精化阶段,使网络能够将高置信度区域的可靠运动估计主动传播至模糊的低置信度区域。在Sintel、KITTI、Spring及LayeredFlow数据集上的广泛实验验证了该方法的有效性。FlowIt在竞争激烈的Sintel与KITTI基准测试中达到当前最优水平,同时在Sintel、Spring及LayeredFlow数据集上创下跨数据集零样本泛化性能的新纪录。