The monocular depth estimation task has recently revealed encouraging prospects, especially for the autonomous driving task. To tackle the ill-posed problem of 3D geometric reasoning from 2D monocular images, multi-frame monocular methods are developed to leverage the perspective correlation information from sequential temporal frames. However, moving objects such as cars and trains usually violate the static scene assumption, leading to feature inconsistency deviation and misaligned cost values, which would mislead the optimization algorithm. In this work, we present CTA-Depth, a Context-aware Temporal Attention guided network for multi-frame monocular Depth estimation. Specifically, we first apply a multi-level attention enhancement module to integrate multi-level image features to obtain an initial depth and pose estimation. Then the proposed CTA-Refiner is adopted to alternatively optimize the depth and pose. During the refinement process, context-aware temporal attention (CTA) is developed to capture the global temporal-context correlations to maintain the feature consistency and estimation integrity of moving objects. In particular, we propose a long-range geometry embedding (LGE) module to produce a long-range temporal geometry prior. Our approach achieves significant improvements over state-of-the-art approaches on three benchmark datasets.
翻译:单目深度估计任务近年来展现出令人鼓舞的前景,特别是在自动驾驶任务中。为解决从2D单目图像进行3D几何推理这一病态问题,人们开发了多帧单目方法,利用连续时间帧之间的透视相关信息。然而,汽车和火车等移动物体通常会违反静态场景假设,导致特征不一致偏差和代价错位,从而误导优化算法。在本文中,我们提出CTA-Depth,一种用于多帧单目深度估计的上下文感知时间注意力引导网络。具体来说,我们首先应用多层级注意力增强模块整合多层级图像特征,以获得初始深度和位姿估计。然后采用提出的CTA-Refiner交替优化深度和位姿。在优化过程中,开发了上下文感知时间注意力(CTA)来捕捉全局时间-上下文相关性,以保持移动物体的特征一致性和估计完整性。特别地,我们提出了长程几何嵌入(LGE)模块来生成长程时间几何先验。我们的方法在三个基准数据集上相比现有最先进方法取得了显著改进。