Recently, CLIP has found practical utility in the domain of pixel-level zero-shot segmentation tasks. The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions. To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel. Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes. Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes. Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes. Code is available at:https://github.com/clearxu/SPT.
翻译:近年来,CLIP在像素级零样本分割任务中展现出实用价值。当前领域存在两阶段方法,常受流程复杂和计算成本高昂等问题困扰。虽然现有单阶段方法缓解了这些问题,并通过视觉提示训练(VPT)保持CLIP的泛化能力,但仍未充分发挥CLIP在像素级未见类别划分和精确像素预测方面的潜力。为进一步激发CLIP的零样本密集预测能力,我们提出SPT-SEG——一种提升CLIP从图像到像素适应性的单阶段方法。具体而言,我们首先引入光谱提示调优(SPT),将光谱提示嵌入CLIP视觉编码器的浅层以捕捉图像结构细节,从而增强对未见类别的理解。随后,我们提出光谱引导解码器(SGD),利用高频与低频信息引导网络的空间注意力聚焦于更显著的分类型特征,实现精确的像素级预测结果。通过在两个公开数据集上的大量实验,我们证明了该方法优于现有先进技术,在所有类别上表现良好,尤其在处理未见类别方面表现突出。代码发布于:https://github.com/clearxu/SPT。