The exploration of mutual-benefit cross-domains has shown great potential toward accurate self-supervised depth estimation. In this work, we revisit feature fusion between depth and semantic information and propose an efficient local adaptive attention method for geometric aware representation enhancement. Instead of building global connections or deforming attention across the feature space without restraint, we bound the spatial interaction within a learnable region of interest. In particular, we leverage geometric cues from semantic information to learn local adaptive bounding boxes to guide unsupervised feature aggregation. The local areas preclude most irrelevant reference points from attention space, yielding more selective feature learning and faster convergence. We naturally extend the paradigm into a multi-head and hierarchic way to enable the information distillation in different semantic levels and improve the feature discriminative ability for fine-grained depth estimation. Extensive experiments on the KITTI dataset show that our proposed method establishes a new state-of-the-art in self-supervised monocular depth estimation task, demonstrating the effectiveness of our approach over former Transformer variants.
翻译:跨领域互益探索在实现精确自监督深度估计方面展现出巨大潜力。本研究重新审视深度与语义信息的特征融合问题,提出一种高效的局部自适应注意力方法以增强几何感知表征。不同于构建全局连接或对特征空间进行无约束形变注意力,我们将空间交互限制在可学习的感兴趣区域内。具体而言,我们利用语义信息中的几何线索学习局部自适应边界框,以引导无监督特征聚合。该局部区域可排除注意力空间中绝大多数不相关参考点,从而实现更具选择性的特征学习与更快的收敛速度。我们将该范式自然扩展为多头层次化机制,实现不同语义级别的信息蒸馏,提升对细粒度深度估计的特征判别能力。在KITTI数据集上的大量实验表明,所提方法在自监督单目深度估计任务上达到新最先进水平,验证了该方法相较此前Transformer变体的有效性。