Referring image segmentation (RIS) is a fundamental vision-language task that intends to segment a desired object from an image based on a given natural language expression. Due to the essentially distinct data properties between image and text, most of existing methods either introduce complex designs towards fine-grained vision-language alignment or lack required dense alignment, resulting in scalability issues or mis-segmentation problems such as over- or under-segmentation. To achieve effective and efficient fine-grained feature alignment in the RIS task, we explore the potential of masked multimodal modeling coupled with self-distillation and propose a novel cross-modality masked self-distillation framework named CM-MaskSD, in which our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment for better segmentation accuracy. Moreover, our CM-MaskSD framework can considerably boost model performance in a nearly parameter-free manner, since it shares weights between the main segmentation branch and the introduced masked self-distillation branches, and solely introduces negligible parameters for coordinating the multimodal features. Comprehensive experiments on three benchmark datasets (i.e. RefCOCO, RefCOCO+, G-Ref) for the RIS task convincingly demonstrate the superiority of our proposed framework over previous state-of-the-art methods.
翻译:指代图像分割(RIS)是一项基础的视觉-语言任务,旨在根据给定的自然语言表达从图像中分割出目标物体。由于图像与文本在数据属性上存在本质差异,现有方法要么引入复杂设计以实现细粒度的视觉-语言对齐,要么缺乏所需的密集对齐,导致可扩展性问题或误分割(如过度分割或欠分割)。为在RIS任务中实现高效且有效的细粒度特征对齐,我们探索了将掩码多模态建模与自蒸馏结合的潜力,并提出了一种名为CM-MaskSD的新型跨模态掩码自蒸馏框架。该方法继承了CLIP模型的图像-文本语义对齐迁移知识,实现细粒度的图像块-词语特征对齐,从而提升分割精度。此外,CM-MaskSD框架能以近乎零参数开销的方式显著提升模型性能,因为其主分割分支与引入的掩码自蒸馏分支共享权重,仅引入可忽略的参数用于协调多模态特征。在三个RIS基准数据集(即RefCOCO、RefCOCO+、G-Ref)上的全面实验充分证明了所提框架相较于先前最先进方法的优越性。