TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object's representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $α_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

翻译：无监督视频对象为中心学习旨在将动态场景分解为在时间上持久的实体表征。现有基于循环视频槽注意力机制的方法在帧间传播固定集合的槽位，但通常假设无条件槽位传播：每个槽位在每一帧都会被更新和解码，无论其对应对象是否可见。我们证明，这种设计违反了持久性槽位的基本生命周期要求：当对象缺失或完全遮挡时，其对应槽位应保留先前状态，避免解释无关可见内容。相反，无条件传播引发两种失效路径：更新诱发的状态漂移（当前帧证据覆盖缺失对象的表征）和解码诱发的重建干扰（非活跃槽位通过解码器注意力与重建保持耦合）。我们提出时序槽位激活（TSA），该机制在不依赖可见性监督的情况下学习每个槽位每帧的激活分数$α_{k,t} \in (0, 1)$。TSA将此激活作为共享潜在控制变量用于槽位生命周期建模。当槽位非活跃时，TSA通过激活门控更新将其状态锚定至前一槽位，并通过在softmax归一化前对注意力logits施加激活依赖的加性偏置来抑制其解码参与。这共同降低了状态漂移和重建驱动干扰。为改善部分遮挡及逐渐重现场景中的决策，TSA进一步将激活预测条件化于由时序上下文编码器生成的每槽位时序记忆之上。我们在MOVi-C/E、YT-VIS和OVIS基准上使用标准指标与基于跟踪的指标（FG-ARI、mBO、IDF1、HOTA）评估TSA。TSA一致性地提升了对象分解与时间身份保持能力，在长时长、高遮挡视频上取得了显著增益。