We present TemporalStereo, a coarse-to-fine stereo matching network that is highly efficient, and able to effectively exploit the past geometry and context information to boost matching accuracy. Our network leverages sparse cost volume and proves to be effective when a single stereo pair is given. However, its peculiar ability to use spatio-temporal information across stereo sequences allows TemporalStereo to alleviate problems such as occlusions and reflective regions while enjoying high efficiency also in this latter case. Notably, our model -- trained once with stereo videos -- can run in both single-pair and temporal modes seamlessly. Experiments show that our network relying on camera motion is robust even to dynamic objects when running on videos. We validate TemporalStereo through extensive experiments on synthetic (SceneFlow, TartanAir) and real (KITTI 2012, KITTI 2015) datasets. Our model achieves state-of-the-art performance on any of these datasets. Code is available at \url{https://github.com/youmi-zym/TemporalStereo.git}.
翻译:我们提出TemporalStereo,一种从粗到细的立体匹配网络,该网络具有极高的效率,并能有效利用过去的几何与上下文信息来提升匹配精度。我们的网络利用稀疏代价体,并在仅给定单个立体像对时被证明是有效的。然而,其跨立体序列利用时空信息的独特能力使TemporalStereo能够缓解遮挡和反射区域等问题,同时在此情景下仍保持高效率。值得注意的是,我们的模型——使用立体视频一次性训练——可无缝运行于单像对和时序模式。实验表明,当在视频上运行时,依赖相机运动的网络对动态物体也具有鲁棒性。我们通过大量在合成数据集(SceneFlow、TartanAir)和真实数据集(KITTI 2012、KITTI 2015)上的实验验证了TemporalStereo。我们的模型在所有这些数据集上均取得了最先进的性能。代码已发布在\url{https://github.com/youmi-zym/TemporalStereo.git}。