Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
翻译:近年来,不依赖真实标注训练的自监督单目深度估计方法备受关注。设计轻量级且高效模型以部署在边缘设备上具有重要研究价值。现有诸多架构通过采用更重的骨干网络提升性能,但牺牲了模型规模。本文提出一种轻量级架构即可达到可比拟的结果。具体而言,我们研究了CNN与Transformer的高效融合,并提出名为Lite-Mono的混合架构。该架构包含连续膨胀卷积(CDC)模块和局部-全局特征交互(LGFI)模块:前者用于提取丰富的多尺度局部特征,后者利用自注意力机制将长距离全局信息编码到特征中。实验表明,Lite-Mono在准确率上大幅超越Monodepth2,同时可训练参数减少约80%。