We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that aim to encode RGB features,DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design; 2) We pre-train the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity to encode RGB-D representations. It avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pre-trained backbones, which widely lies in existing methods but has not been resolved. We fine-tune the pre-trained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D segmentation datasets and five RGB-D saliency datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.
翻译:我们提出DFormer——一种新颖的RGB-D预训练框架,用于学习可迁移的RGB-D分割任务表示。DFormer具有两大关键创新:1)与以往致力于编码RGB特征的工作不同,DFormer包含一系列RGB-D模块,这些模块通过新颖的结构设计专用于同时编码RGB与深度信息;2)我们利用ImageNet-1K的图像-深度对预训练骨干网络,使DFormer具备编码RGB-D表示的能力。这避免了现有方法中普遍存在但尚未解决的、由RGB预训练骨干网络对深度图三维几何关系进行不匹配编码的问题。我们在两个主流RGB-D任务(即RGB-D语义分割与RGB-D显著目标检测)上,采用轻量化解码器头部对预训练后的DFormer进行微调。实验结果表明,在两个RGB-D分割数据集和五个RGB-D显著目标检测数据集上,DFormer以不足当前最优方法一半的计算成本,在这两项任务中均达到了新的最优性能。我们的代码已开源至:https://github.com/VCIP-RGBD/DFormer。