We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.
翻译:我们提出了DFormer,一种新颖的RGB-D预训练框架,用于学习RGB-D分割任务的可迁移表示。DFormer具有两项关键创新:1)不同于以往使用RGB预训练主干网络编码RGB-D信息的方法,我们利用ImageNet-1K的图像-深度对预训练主干网络,从而使DFormer具备编码RGB-D表示的能力;2)DFormer包含一系列RGB-D模块,这些模块通过新颖的构建块设计,专门用于同时编码RGB和深度信息。DFormer避免了现有方法中普遍存在但尚未解决的RGB预训练主干网络对深度图中3D几何关系编码不匹配的问题。我们在两个流行的RGB-D任务(即RGB-D语义分割和RGB-D显著目标检测)上使用轻量化解码器头对预训练的DFormer进行微调。实验结果表明,在两个RGB-D语义分割数据集和五个RGB-D显著目标检测数据集上,我们的DFormer在计算成本不足当前最佳方法一半的情况下,实现了新的最先进性能。我们的代码开源在:https://github.com/VCIP-RGBD/DFormer。