Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

Recently, the performance of monocular depth estimation (MDE) has been significantly boosted with the integration of transformer models. However, the transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions. This limitation hinders their deployment on resource-limited devices. In this paper, we propose a cross-architecture knowledge distillation method for MDE, dubbed DisDepth, to enhance efficient CNN models with the supervision of state-of-the-art transformer models. Concretely, we first build a simple framework of convolution-based MDE, which is then enhanced with a novel local-global convolution module to capture both local and global information in the image. To effectively distill valuable information from the transformer teacher and bridge the gap between convolution and transformer features, we introduce a method to acclimate the teacher with a ghost decoder. The ghost decoder is a copy of the student's decoder, and adapting the teacher with the ghost decoder aligns the features to be student-friendly while preserving their original performance. Furthermore, we propose an attentive knowledge distillation loss that adaptively identifies features valuable for depth estimation. This loss guides the student to focus more on attentive regions, improving its performance. Extensive experiments on KITTI and NYU Depth V2 datasets demonstrate the effectiveness of DisDepth. Our method achieves significant improvements on various efficient backbones, showcasing its potential for efficient monocular depth estimation.

翻译：近期，Transformer模型的集成显著提升了单目深度估计的性能。然而，Transformer模型通常计算成本高昂，且其在轻量级模型中的有效性逊于卷积网络。这一局限性阻碍了其在资源受限设备上的部署。为此，本文提出一种用于单目深度估计的跨架构知识蒸馏方法DisDepth，旨在借助最先进的Transformer模型的监督能力增强高效的CNN模型。具体而言，我们首先构建了一个基于卷积的单目深度估计基础框架，并引入新型局部-全局卷积模块以捕捉图像中的局部与全局信息。为有效蒸馏Transformer教师模型中的有价值信息，并弥合卷积与Transformer特征之间的差异，我们提出了一种通过虚拟解码器适配教师模型的方法。该虚拟解码器为学生解码器的副本，通过适配教师模型，可在保留其原始性能的同时，使特征对学生模型更友好。此外，我们提出了一种注意力知识蒸馏损失，可自适应识别对深度估计有价值的特征。该损失函数引导学生模型更关注关键区域，从而提升其性能。在KITTI和NYU Depth V2数据集上的大量实验验证了DisDepth的有效性。该方法在多种高效骨干网络上实现了显著性能提升，展现了其在高效单目深度估计领域的应用潜力。