Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.
翻译:自监督单目深度估计因其无需标注训练数据而成为一种前景广阔的方法。现有方法大多结合卷积与Transformer来建模长距离依赖以实现精确深度估计。然而,Transformer将二维图像特征视为一维序列,其位置编码虽能部分缓解不同特征块间空间信息的丢失,但往往忽略通道特征,从而限制了深度估计的性能。本文提出一种自监督单目深度估计网络以获取更精细的细节。具体而言,我们设计了一种基于大核注意力的解码器,该模块能够在保持特征通道自适应性的同时建模长距离依赖,且不破坏特征的二维结构。此外,我们引入上采样模块以精确恢复深度图中的细节信息。本方法在KITTI数据集上取得了具有竞争力的结果。