Slot attention aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable various downstream tasks. Yet, these slots often bind to object parts, not objects themselves, especially for real-world datasets. To address this, we introduce Guided Latent Slot Diffusion - GLASS, an object-centric model that uses generated captions as a guiding signal to better align slots with objects. Our key insight is to learn the slot-attention module in the space of generated images. This allows us to repurpose the pre-trained diffusion decoder model, which reconstructs the images from the slots, as a semantic mask generator based on the generated captions. GLASS learns an object-level representation suitable for multiple tasks simultaneously, e.g., segmentation, image generation, and property prediction, outperforming previous methods. For object discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU over the previous state-of-the-art (SOTA) method on the VOC and COCO datasets, respectively, and establishes a new SOTA FID score for conditional image generation amongst slot-attention-based methods. For the segmentation task, GLASS surpasses SOTA weakly-supervised and language-based segmentation models, which were specifically designed for the task.
翻译:槽注意力旨在将输入图像分解为一组有意义的对象文件(槽)。这些潜在对象表示支持各种下游任务。然而,这些槽通常绑定到对象部分而非对象本身,尤其是在真实世界数据集中。为解决此问题,我们提出了引导式潜在槽扩散模型——GLASS,这是一种以对象为中心的模型,它使用生成的描述作为引导信号,以更好地将槽与对象对齐。我们的核心见解是在生成图像的空间中学习槽注意力模块。这使得我们能够重新利用预训练的扩散解码器模型(该模型从槽重建图像),将其作为基于生成描述的语义掩码生成器。GLASS学习到一个适用于多种任务(例如分割、图像生成和属性预测)的对象级表示,其性能超越了先前的方法。在对象发现任务中,GLASS在VOC和COCO数据集上的mIoU相对于先前的最先进方法分别实现了约+35%和+10%的相对提升,并在基于槽注意力的方法中为条件图像生成建立了新的最先进FID分数。对于分割任务,GLASS超越了专门为该任务设计的最先进弱监督和基于语言的分割模型。