Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.


翻译:近年来,基于CLIP的少样本语义分割方法通过引入类别级文本先验来辅助分割,通常使用单一提示(例如“一张[类别]的照片”)。然而,这些方法往往导致目标区域激活不完整,因为单一的文本描述无法充分捕捉复杂类别的语义多样性。此外,它们缺乏显式的跨模态交互,且易受噪声支持特征的影响,进一步降低了视觉先验的质量。为解决这些问题,我们提出了多文本引导的少样本语义分割网络(MTGNet),这是一个双分支框架,通过融合多样化的文本提示来细化文本先验,并指导视觉先验的跨模态优化,从而提升分割性能。具体而言,我们设计了一个多文本先验细化(MTPR)模块,该模块抑制干扰并聚合互补的语义线索,以增强前景激活并扩展结构复杂对象的语义覆盖范围。我们引入了文本锚点特征融合(TAFF)模块,该模块利用多文本嵌入作为语义锚点,促进判别性局部原型从支持图像到查询图像的迁移,从而提高语义一致性并缓解类内差异。此外,我们提出了一个前景置信度加权注意力(FCWA)模块,通过利用支持前景特征内部的自我相似性来增强视觉先验的鲁棒性。它自适应地降低不一致区域的权重,并有效抑制查询分割过程中的干扰。在标准少样本语义分割基准上的大量实验验证了MTGNet的有效性。在1-shot设置下,它在PASCAL-5i上达到了76.8%的mIoU,在COCO-20i上达到了57.4%,在类内差异较大的fold上取得了显著提升。

0
下载
关闭预览

相关内容

小样本学习(Few-Shot Learning,以下简称 FSL )用于解决当可用的数据量比较少时,如何提升神经网络的性能。在 FSL 中,经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样,Meta-learning 也包含训练过程和测试过程,但是它的训练过程被称作 Meta-training 和 Meta-testing。
Top
微信扫码咨询专知VIP会员