Depth estimation provides an alternative approach for perceiving 3D information in autonomous driving. Monocular depth estimation, whether with single-frame or multi-frame inputs, has achieved significant success by learning various types of cues and specializing in either static or dynamic scenes. Recently, these cues fusion becomes an attractive topic, aiming to enable the combined cues to perform well in both types of scenes. However, adaptive cue fusion relies on attention mechanisms, where the quadratic complexity limits the granularity of cue representation. Additionally, explicit cue fusion depends on precise segmentation, which imposes a heavy burden on mask prediction. To address these issues, we propose the GSDC Transformer, an efficient and effective component for cue fusion in monocular multi-frame depth estimation. We utilize deformable attention to learn cue relationships at a fine scale, while sparse attention reduces computational requirements when granularity increases. To compensate for the precision drop in dynamic scenes, we represent scene attributes in the form of super tokens without relying on precise shapes. Within each super token attributed to dynamic scenes, we gather its relevant cues and learn local dense relationships to enhance cue fusion. Our method achieves state-of-the-art performance on the KITTI dataset with efficient fusion speed.
翻译:深度估计为自动驾驶中的三维信息感知提供了一种替代方案。无论是单帧还是多帧输入,单目深度估计通过学习不同类型的线索并专注于静态或动态场景,已取得显著成功。近期,这些线索的融合成为一个引人关注的研究方向,旨在使融合后的线索在两种场景下均表现良好。然而,自适应的线索融合依赖于注意力机制,其二次复杂度限制了线索表示的粒度。此外,显式的线索融合依赖于精确分割,这给掩膜预测带来了沉重负担。为解决这些问题,我们提出了GSDC Transformer——一种用于单目多帧深度估计中线索融合的高效且有效的组件。我们利用可变形注意力在精细尺度上学习线索关系,同时稀疏注意力在粒度增加时降低计算需求。为弥补动态场景中精度下降的问题,我们以超级令牌的形式表示场景属性,而不依赖于精确的形状。在每个归属于动态场景的超级令牌内部,我们收集其相关线索并学习局部密集关系以增强线索融合。我们的方法在KITTI数据集上实现了领先性能,并具有高效的融合速度。