Selective Synergistic Learning for Video Object-Centric Learning

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

翻译：典型的视频对象中心学习（VOCL）方法采用基于槽的框架，依赖重建驱动的编码器-解码器架构，其中学习过程通过两个空间图进行调节：编码器的注意力图和解码器的对象图。由于这两个不同的图表现出不同特性，最近的密集对齐策略试图通过对比学习强制所有时空块的匹配来调和这一差异。然而，这种不加区分的对齐会无意中传播每个模块固有的弱点，例如编码器预测的噪声和解码器边界的模糊性。此外，计算所有块对之间的密集相似性会导致与时空块总数成二次方的计算成本，严重限制了可扩展性。受此启发，我们提出了选择性协同学习（SSync）。SSync 不进行穷举的块到块对齐，而是通过选择性蒸馏仅提取最可靠的线索来防止错误传播：严格利用编码器进行边界细化，并利用解码器进行内部去噪。这通过具有线性复杂度的伪标签分配实现，无需二次空间比较。同时，为防止强化架构偏差（如槽冗余），我们引入了一种传递性伪标签合并方法，基于时空激活一致性合并重叠的槽。大量实验表明，SSync 提高了分解质量，并能作为一个通用即插即用模块，同时对槽配置表现出卓越的鲁棒性。代码见 github.com/wjun0830/SSync。