The extraction of modular object-centric representations for downstream tasks is an emerging area of research. Learning grounded representations of objects that are guaranteed to be stable and invariant promises robust performance across different tasks and environments. Slot Attention (SA) learns object-centric representations by assigning objects to \textit{slots}, but presupposes a \textit{single} distribution from which all slots are randomly initialised. This results in an inability to learn \textit{specialized} slots which bind to specific object types and remain invariant to identity-preserving changes in object appearance. To address this, we present \emph{\textsc{Co}nditional \textsc{S}lot \textsc{A}ttention} (\textsc{CoSA}) using a novel concept of \emph{Grounded Slot Dictionary} (GSD) inspired by vector quantization. Our proposed GSD comprises (i) canonical object-level property vectors and (ii) parametric Gaussian distributions, which define a prior over the slots. We demonstrate the benefits of our method in multiple downstream tasks such as scene generation, composition, and task adaptation, whilst remaining competitive with SA in popular object discovery benchmarks.
翻译:从可泛化、不变性保证的对象表征中提取模块化对象中心化表示,是当前新兴研究领域。学习具有稳定性和不变性的对象基础表征,有望在不同任务和环境中实现鲁棒性能。插槽注意力机制(SA)通过将对象映射到"插槽"(slots)来学习对象中心化表征,但该方法假设所有插槽均从同一分布中随机初始化,导致无法学习绑定特定对象类型、且能保持对象外观身份不变性特征的专用化插槽。为此,本文提出**条件插槽注意力机制**(Conditional Slot Attention,简称CoSA),创新性地引入受向量量化启发的**基础插槽字典**(Grounded Slot Dictionary,GSD)。该字典包含两部分:(i)规范化的对象层级属性向量;(ii)定义插槽先验分布的参数化高斯分布。实验表明,本方法在场景生成、组合调控、任务自适应等多个下游任务中展现出显著优势,同时在主流对象发现基准测试中与SA保持相当性能。