Both optical flow and stereo disparities are image matches and can therefore benefit from joint training. Depth and 3D motion provide geometric rather than photometric information and can further improve optical flow. Accordingly, we design a first network that estimates flow and disparity jointly and is trained without supervision. A second network, trained with optical flow from the first as pseudo-labels, takes disparities from the first network, estimates 3D rigid motion at every pixel, and reconstructs optical flow again. A final stage fuses the outputs from the two networks. In contrast with previous methods that only consider camera motion, our method also estimates the rigid motions of dynamic objects, which are of key interest in applications. This leads to better optical flow with visibly more detailed occlusions and object boundaries as a result. Our unsupervised pipeline achieves 7.36% optical flow error on the KITTI-2015 benchmark and outperforms the previous state-of-the-art 9.38% by a wide margin. It also achieves slightly better or comparable stereo depth results. Code will be made available.
翻译:光流与立体视差均源于图像匹配任务,因此联合训练可使两者相互受益。深度与三维运动提供的是几何信息而非光度信息,可进一步优化光流估计。为此,我们首先设计了一个无监督训练框架,该网络可同时估计光流与视差。第二个网络以首个网络生成的光流作为伪标签,结合第一个网络输出的视差图,逐像素估计三维刚体运动,并重新重建光流。最终阶段融合两个网络的输出结果。与仅考虑相机运动的先前方法不同,我们的方法还估计了动态物体的刚体运动——这在应用场景中具有关键意义。这一设计使光流估计质量显著提升,遮挡区域与物体边界细节更加清晰。我们的无监督流程在KITTI-2015基准上达到7.36%的光流误差,以显著优势超越先前最优方法的9.38%,同时在立体深度估计上取得更优或可比的结果。代码将开源。