Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.
翻译:无监督视频面向对象学习旨在无需监督的情况下将动态场景分解为持久的、对象级别的表示。然而,现有的基于槽的方法在快速运动和部分遮挡等具有挑战性的场景中难以保持稳定的对象身份。首先,它们通常将对象逐帧的外观以及跨帧的身份编码到单个槽向量中,造成了目标冲突,导致槽交换:重建需要对瞬态视觉变化敏感,而时间一致性需要对这些变化保持不变。其次,槽注意力中使用的令牌重归一化会放大弱关注槽,使它们能够吸收其他对象的令牌,从而破坏槽与对象的对应关系。我们提出双态槽注意力(DSSA),这是一个完全自监督的框架,通过将外观与身份分离并减少弱匹配槽的虚假更新来解决这些局限性。DSSA将每个槽分解为用于逐帧外观的局部状态和用于时间稳定对象信息的身份状态,从而将重建和时间一致性与不同的表示对齐。身份状态通过一个学习到的循环转换进行更新,该转换充当局部状态的时间过滤器,而竞争调制聚合(CMA)则降低弱匹配槽的更新权重,防止它们吸收其他对象的令牌。在MOVi-C、MOVi-D和YouTube-VIS上的实验表明,DSSA在分割质量和时间一致性上持续优于先前方法,同时在下游对象识别和视频动态预测方面也取得了更强性能。代码和模型将在录用后公开发布。