In this work, we address the task of few-shot part segmentation, which aims to segment the different parts of an unseen object using very few labeled examples. It is found that leveraging the textual space of a powerful pre-trained image-language model (such as CLIP) can be beneficial in learning visual features. Therefore, we develop a novel method termed PartSeg for few-shot part segmentation based on multimodal learning. Specifically, we design a part-aware prompt learning method to generate part-specific prompts that enable the CLIP model to better understand the concept of ``part'' and fully utilize its textual space. Furthermore, since the concept of the same part under different object categories is general, we establish relationships between these parts during the prompt learning process. We conduct extensive experiments on the PartImageNet and Pascal$\_$Part datasets, and the experimental results demonstrated that our proposed method achieves state-of-the-art performance.
翻译:本文针对少样本部件分割任务展开研究,旨在利用极少量标注样本对未见物体的不同部件进行分割。研究发现,利用强大的预训练图像-语言模型(如CLIP)的文本空间有助于学习视觉特征。为此,我们提出了一种名为PartSeg的少样本部件分割新方法,该方法基于多模态学习。具体而言,我们设计了一种部件感知的提示学习方法,用于生成部件特定的提示,使CLIP模型能够更好地理解"部件"概念并充分利用其文本空间。此外,由于相同部件在不同物体类别中的概念具有通用性,我们在提示学习过程中建立了这些部件之间的关联。我们在PartImageNet和Pascal_Part数据集上进行了大量实验,实验结果表明,所提方法达到了最先进的性能水平。