Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
翻译:无监督物体中心学习(OCL)将视觉场景分解为不同的实体。槽注意力是一种流行的方法,它将单个物体表示为称为槽的潜在向量。现有方法仅从预训练视觉Transformer(ViT)的最后一层获取这些槽表示,忽略了其他层中编码的宝贵且语义丰富的信息。为了更好地利用这种潜在语义信息,我们提出了MUFASA,一个轻量级即插即用框架,用于基于槽注意力的无监督物体分割方法。我们的模型在ViT编码器的多个特征层上计算槽注意力,充分利用其语义丰富性。我们提出了一种融合策略,将多个层上获得的槽聚合为统一的物体中心表示。将MUFASA集成到现有OCL方法中,提升了它们在多个数据集上的分割效果,创造了新的最优性能,同时仅以微小的推理开销为代价显著改善了训练收敛性。