The recent Segment Anything Model (SAM) represents a significant breakthrough in scaling segmentation models, delivering strong performance across various downstream applications in the RGB modality. However, directly applying SAM to emerging visual modalities, such as depth and event data results in suboptimal performance in multi-modal segmentation tasks. In this paper, we make the first attempt to adapt SAM for multi-modal semantic segmentation by proposing a Mixture of Low-Rank Adaptation Experts (MoE-LoRA) tailored for different input visual modalities. By training only the MoE-LoRA layers while keeping SAM's weights frozen, SAM's strong generalization and segmentation capabilities can be preserved for downstream tasks. Specifically, to address cross-modal inconsistencies, we propose a novel MoE routing strategy that adaptively generates weighted features across modalities, enhancing multi-modal feature integration. Additionally, we incorporate multi-scale feature extraction and fusion by adapting SAM's segmentation head and introducing an auxiliary segmentation head to combine multi-scale features for improved segmentation performance effectively. Extensive experiments were conducted on three multi-modal benchmarks: DELIVER, MUSES, and MCubeS. The results consistently demonstrate that the proposed method significantly outperforms state-of-the-art approaches across diverse scenarios. Notably, under the particularly challenging condition of missing modalities, our approach exhibits a substantial performance gain, achieving an improvement of 32.15% compared to existing methods.
翻译:近期提出的Segment Anything Model(SAM)在分割模型规模化方面取得了重大突破,在RGB模态的各类下游应用中展现出卓越性能。然而,直接将SAM应用于新兴视觉模态(如深度和事件数据)时,在多模态分割任务中表现欠佳。本文首次尝试通过提出专为不同输入视觉模态设计的专家混合低秩适配(MoE-LoRA)方法,将SAM适配于多模态语义分割任务。通过仅训练MoE-LoRA层并保持SAM权重冻结,能够为下游任务保留SAM强大的泛化能力和分割性能。具体而言,为解决跨模态不一致性问题,我们提出一种新颖的MoE路由策略,能够自适应生成跨模态加权特征,从而增强多模态特征融合。此外,我们通过改造SAM的分割头并引入辅助分割头,实现了多尺度特征提取与融合,有效结合多尺度特征以提升分割性能。在三个多模态基准数据集(DELIVER、MUSES和MCubeS)上进行了大量实验。结果表明,所提方法在多种场景下均显著优于现有最优方法。值得注意的是,在模态缺失的极端挑战条件下,本方法展现出大幅性能提升,较现有方法实现了32.15%的性能增益。