Large vision and language models learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without interconnections. Our key observation is that, while an image can be recognized in multiple ways, each has a consistent part-and-whole visual organization. Segmentation thus should be treated not as an end task to be mastered through supervised learning, but as an internal process that evolves with and supports the ultimate goal of recognition. We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives. We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition. Enhancing the Vision Transformer (ViT) with adaptive segment tokens and graph pooling, our model surpasses ViT in unsupervised part-whole discovery, semantic segmentation, image classification, and efficiency. Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images and 1 billion masks) by absolute 8% in mIoU on PartImageNet object segmentation.
翻译:大型视觉语言模型通过图像-文本关联直接学习,往往缺乏详细的视觉证实,而图像分割任务则独立于识别进行,通过监督学习方式训练,彼此缺乏关联。我们的关键观察是:尽管一幅图像可以通过多种方式被识别,但每种识别方式都具备一致的部件-整体视觉组织。因此,分割不应被视为通过监督学习掌握的终极任务,而应作为伴随并支持识别最终目标的内在过程。我们提出将层次化分割器融入识别过程,仅基于图像级识别目标对整个模型进行训练和自适应调整。我们在无需额外标注的情况下,伴随识别过程自然而然地学习层次化分割,自动揭示不仅支撑识别、更能增强识别的部件-整体关系。通过为Vision Transformer(ViT)引入自适应分割标记和图池化,我们的模型在无监督部件-整体发现、语义分割、图像分类及效率方面均超越ViT。值得注意的是,我们的模型(基于无标注的100万ImageNet图像训练)在PartImageNet物体分割任务上,mIoU绝对值比SAM(基于1100万图像和10亿掩码训练)高出8%。