With the increasing availability of depth sensors, multimodal frameworks that combine color information with depth data are attracting increasing interest. In the challenging task of semantic segmentation, depth maps allow to distinguish between similarly colored objects at different depths and provide useful geometric cues. On the other side, ground truth data for semantic segmentation is burdensome to be provided and thus domain adaptation is another significant research area. Specifically, we address the challenging source-free domain adaptation setting where the adaptation is performed without reusing source data. We propose MISFIT: MultImodal Source-Free Information fusion Transformer, a depth-aware framework which injects depth information into a segmentation module based on vision transformers at multiple stages, namely at the input, feature and output levels. Color and depth style transfer helps early-stage domain alignment while re-wiring self-attention between modalities creates mixed features allowing the extraction of better semantic content. Furthermore, a depth-based entropy minimization strategy is also proposed to adaptively weight regions at different distances. Our framework, which is also the first approach using vision transformers for source-free semantic segmentation, shows noticeable performance improvements with respect to standard strategies.
翻译:随着深度传感器的日益普及,融合颜色信息与深度数据的多模态框架正吸引越来越多的关注。在语义分割这一具有挑战性的任务中,深度图能够区分不同深度处颜色相似的物体,并提供有用的几何线索。另一方面,语义分割的标注数据获取成本高昂,因此域适应成为另一个重要的研究领域。具体而言,我们聚焦于极具挑战性的无源域适应设定——在该设定下,适应过程无需重复使用源域数据。我们提出MISFIT(多模态无源信息融合变换器),这是一种深度感知框架,它通过多层级注入深度信息至基于视觉变换器的分割模块中,具体包括输入层、特征层和输出层。颜色与深度风格迁移有助于早期阶段的域对齐,而跨模态自注意力机制的重连可生成混合特征,从而提取更优的语义内容。此外,我们还提出一种基于深度图的熵最小化策略,用于自适应地加权不同距离区域。本文提出的框架是首个采用视觉变换器实现无源语义分割的方法,相较于标准策略展现出显著的性能提升。