Video depth estimation is crucial in various applications, such as scene reconstruction and augmented reality. In contrast to the naive method of estimating depths from images, a more sophisticated approach uses temporal information, thereby eliminating flickering and geometrical inconsistencies. We propose a consistent method for dense video depth estimation; however, unlike the existing monocular methods, ours relates to stereo videos. This technique overcomes the limitations arising from the monocular input. As a benefit of using stereo inputs, a left-right consistency loss is introduced to improve the performance. Besides, we use SLAM-based camera pose estimation in the process. To address the problem of depth blurriness during test-time training (TTT), we present an edge-preserving loss function that improves the visibility of fine details while preserving geometrical consistency. We show that our edge-aware stereo video model can accurately estimate the dense depth maps.
翻译:视频深度估计在场景重建和增强现实等应用中至关重要。与从图像中估计深度的朴素方法不同,一种更复杂的方法利用时间信息,从而消除闪烁和几何不一致性。我们提出了一种用于密集视频深度估计的一致方法;然而,与现有的单目方法不同,我们的方法涉及立体视频。该技术克服了单目输入带来的局限性。作为使用立体输入的好处,引入了左右一致性损失以提高性能。此外,我们在过程中使用基于SLAM的相机姿态估计。为了解决测试时训练(TTT)中的深度模糊问题,我们提出了一种边缘保留损失函数,在保持几何一致性的同时提高精细细节的可见性。我们证明,我们的边缘感知立体视频模型能够准确估计密集深度图。