RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.
翻译:RGB-D已逐渐成为辅助驾驶中理解复杂场景的关键数据源。然而,现有研究对深度图固有的空间特性关注不足。这一疏忽显著影响了注意力表征,导致因注意力偏移问题而产生预测误差。为此,我们提出了一种新颖的可学习深度交互金字塔Transformer(DiPFormer)来探索深度的有效性。首先,我们引入深度空间感知优化(Depth SAO)作为偏移量来表示真实世界的空间关系。其次,通过深度线性交叉注意力(Depth LCA)学习RGB-D在特征空间中的相似性,以在像素级别明确空间差异。最后,利用MLP解码器有效融合多尺度特征,以满足实时性要求。综合实验表明,所提出的DiPFormer在道路检测(+7.5%)和语义分割(+4.9% / +1.5%)任务中均显著解决了注意力错位问题。DiPFormer在KITTI(KITTI道路数据集上97.57%的F分数,KITTI-360数据集上68.74%的mIoU)和Cityscapes(83.4% mIoU)数据集上实现了最先进的性能。