We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.
翻译:我们通过利用自监督预训练的视觉Transformer(ViT)来解决弱监督小样本图像分类与分割任务。所提出的方法从自监督ViT中提取令牌表示,并通过自注意力机制利用其相关性,经由独立的任务头生成分类与分割预测。该模型能在训练期间仅使用图像级标签而无像素级标签的情况下,有效学习执行分类与分割任务。为此,它利用自监督ViT骨干网络生成的令牌注意力图作为像素级伪标签。我们还探索了一种“混合”监督的实际设置,其中少量训练图像包含真实像素级标签,其余图像仅有图像级标签。针对此混合设置,我们提出使用基于可用真实像素级标签训练的伪标签增强器来改进伪标签。在Pascal-5i和COCO-20i数据集上的实验表明,该方法在多种监督设置下均取得显著性能提升,尤其当像素级标签极少或完全缺失时效果突出。