Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to leverage temporal information to improve the temporal consistency, accuracy, and efficiency of stereo matching. To achieve this, we formulate video stereo matching as a process of temporal disparity completion followed by continuous iterative refinements. Specifically, we first project the disparity of the previous timestamp to the current viewpoint, obtaining a semi-dense disparity map. Then, we complete this map through a disparity completion module to obtain a well-initialized disparity map. The state features from the current completion module and from the past refinement are fused together, providing a temporally coherent state for subsequent refinement. Based on this coherent state, we introduce a dual-space refinement module to iteratively refine the initialized result in both disparity and disparity gradient spaces, improving estimations in ill-posed regions. Extensive experiments demonstrate that our method effectively alleviates temporal inconsistency while enhancing both accuracy and efficiency.
翻译:立体匹配通过双目图像为下游应用提供深度估计。这些应用大多以视频流作为输入,需要时序一致的深度图。然而,现有方法主要关注单帧层面的估计,这通常会导致时序不一致的结果,特别是在不适定区域。本文旨在利用时序信息提升立体匹配的时序一致性、精度与效率。为此,我们将视频立体匹配建模为一个时序视差补全与连续迭代优化的过程。具体而言,我们首先将前一时刻的视差投影至当前视角,获得半稠密视差图;随后通过视差补全模块完成该图,得到良好初始化的视差图。当前补全模块与过往优化模块的状态特征被融合,为后续优化提供时序连贯的状态。基于此连贯状态,我们引入双空间优化模块,在视差空间与视差梯度空间中迭代优化初始化结果,从而改善不适定区域的估计。大量实验表明,本方法在提升精度与效率的同时,有效缓解了时序不一致问题。