This work aims to estimate a high-quality depth map from a single RGB image. Due to the lack of depth clues, making full use of the long-range correlation and the local information is critical for accurate depth estimation. Towards this end, we introduce an uncertainty rectified cross-distillation between Transformer and convolutional neural network (CNN) to learn a unified depth estimator. Specifically, we use the depth estimates from the Transformer branch and the CNN branch as pseudo labels to teach each other. Meanwhile, we model the pixel-wise depth uncertainty to rectify the loss weights of noisy pseudo labels. To avoid the large capacity gap induced by the strong Transformer branch deteriorating the cross-distillation, we transfer the feature maps from Transformer to CNN and design coupling units to assist the weak CNN branch to leverage the transferred features. Furthermore, we propose a surprisingly simple yet highly effective data augmentation technique CutFlip, which enforces the model to exploit more valuable clues apart from the vertical image position for depth inference. Extensive experiments demonstrate that our model, termed~\textbf{URCDC-Depth}, exceeds previous state-of-the-art methods on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets, even with no additional computational burden at inference time. The source code is publicly available at \url{https://github.com/ShuweiShao/URCDC-Depth}.
翻译:本文旨在从单张RGB图像估计高质量深度图。由于缺乏深度线索,充分利用长程关联与局部信息对精确深度估计至关重要。为此,我们提出Transformer与卷积神经网络(CNN)之间的不确定性校正交叉蒸馏方法,以学习统一的深度估计器。具体而言,我们分别将Transformer分支与CNN分支的深度估计作为伪标签进行相互教学,同时建立像素级深度不确定性模型以校正噪声伪标签的损失权重。为避免强Transformer分支与弱CNN分支之间的容量差距恶化交叉蒸馏效果,我们将Transformer特征图迁移至CNN,并设计耦合单元辅助弱CNN分支利用迁移特征。此外,我们提出一种极简且高效的数据增强技术CutFlip,迫使模型在深度推理过程中除垂直图像位置外挖掘更多有价值线索。大量实验表明,本模型(称为\textbf{URCDC-Depth})在KITTI、NYU-Depth-v2和SUN RGB-D数据集上超越先前最优方法,且推理阶段无额外计算负担。源代码已开源至 \url{https://github.com/ShuweiShao/URCDC-Depth}。