Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting real-world image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement. First, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.
翻译:尽管自监督预训练方法在高层次下游任务中表现卓越,但其在立体匹配、光流等密集几何视觉任务中的潜力尚未完全释放。将实例判别、掩码图像建模等自监督概念应用于几何任务仍是活跃的研究方向。本文基于近期提出的跨视图补全框架(一种掩码图像建模的变体)展开研究,该框架利用同一场景的第二视角图像,特别适合双目下游任务。然而,该概念的实用性此前至少受限于两方面因素:(a)真实世界图像对采集困难——实际上仅能使用合成数据;(b)基础Transformer难以泛化至密集下游任务——这类任务中相对位置比绝对位置更具意义。我们探索了三条改进路径:首先,提出大规模采集真实世界图像对的方法;其次,实验相对位置嵌入,证明其能显著提升视觉Transformer性能;最后,借助海量数据支持,扩展基于视觉Transformer的跨视图补全架构规模。凭借这些改进,我们首次证明无需使用相关性体素、迭代估计、图像扭曲或多尺度推理等任何经典任务特定技术,即可在立体匹配与光流任务中达到最优结果,为通用视觉模型铺平道路。