Slot attention has shown remarkable object-centric representation learning performance in computer vision tasks without requiring any supervision. Despite its object-centric binding ability brought by compositional modelling, as a deterministic module, slot attention lacks the ability to generate novel scenes. In this paper, we propose the Slot-VAE, a generative model that integrates slot attention with the hierarchical VAE framework for object-centric structured scene generation. For each image, the model simultaneously infers a global scene representation to capture high-level scene structure and object-centric slot representations to embed individual object components. During generation, slot representations are generated from the global scene representation to ensure coherent scene structures. Our extensive evaluation of the scene generation ability indicates that Slot-VAE outperforms slot representation-based generative baselines in terms of sample quality and scene structure accuracy.
翻译:插槽注意力在计算机视觉任务中展现了卓越的以对象为中心的表征学习能力,且无需任何监督。尽管其通过组合建模带来的以对象为中心的绑定能力,但作为一种确定性模块,插槽注意力缺乏生成新颖场景的能力。本文提出Slot-VAE——一种将插槽注意力与分层VAE框架相结合的生成模型,用于以对象为中心的结构化场景生成。对每张图像,该模型同时推断全局场景表征以捕获高层场景结构,以及以对象为中心的插槽表征以嵌入独立对象组件。在生成过程中,插槽表征由全局场景表征生成,从而确保场景结构的一致性。我们对场景生成能力的广泛评估表明,Slot-VAE在样本质量和场景结构准确性方面优于基于插槽表征的生成基线模型。