Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few-shot semantic segmentation, pixel-level annotations are time-consuming and costly. Therefore, in this paper, we utilize the more challenging image-level annotations and propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation (WFSS). Specifically, we first propose a cross-granularity frequency-aware module (CFM) that decouples RGB images into high-frequency and low-frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi-modal language-vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP-guided spatial-adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross-modal semantic information for CFM. Extensive experiments on the Pascal-5\textsuperscript{i} and COCO-20\textsuperscript{i} datasets demonstrate that AFANet has achieved state-of-the-art performance. The code is available at https://github.com/jarch-ma/AFANet.
翻译:少样本学习旨在通过利用从少量样本中学习到的先验知识来识别新概念。然而,对于少样本语义分割等视觉密集型任务,像素级标注耗时且成本高昂。因此,本文利用更具挑战性的图像级标注,提出了一种用于弱监督少样本语义分割的自适应频率感知网络。具体而言,我们首先提出了一种跨粒度频率感知模块,该模块将RGB图像解耦为高频和低频分布,并通过重新对齐它们来进一步优化语义结构信息。与大多数现有弱监督少样本语义分割方法以离线学习方式使用来自多模态语言-视觉模型(如CLIP)的文本信息不同,我们进一步提出了一种CLIP引导的空间适配器模块,该模块通过在线学习对文本信息进行空间域自适应变换,从而为跨粒度频率感知模块提供丰富的跨模态语义信息。在Pascal-5\textsuperscript{i}和COCO-20\textsuperscript{i}数据集上的大量实验表明,AFANet取得了最先进的性能。代码可在https://github.com/jarch-ma/AFANet获取。