Slot-oriented approaches for compositional scene segmentation from images and videos still depend on provided background information or slot assignments. We present Loci-Segmented (Loci-s) building on the slot-based location and identity tracking architecture Loci (Traub et al., ICLR 2023). Loci-s enables dynamic (i) background processing by means of a foreground identifying module and a background re-generator; (ii) top-down modified object-focused bottom-up processing; and (iii) depth estimate generation. We also improve automatic slot assignment via a slot-location-entity regularization mechanism and a prior segmentation network. The results reveal superior video decomposition performance in the MOVi datasets and in another established dataset collection targeting scene segmentation. Loci-s outperforms the state-of-the-art with respect to the intersection over union (IoU) score in the multi-object video dataset MOVi-E by a large margin and even without supervised slot assignments and without the provision of background information. We furthermore show that Loci-s generates well-interpretable latent representations. These representations may serve as a foundation-model-like interpretable basis for solving downstream tasks, such as grounding language, forming compositional rules, or solving one-shot reinforcement learning tasks.
翻译:基于槽的图像与视频组合式场景分割方法仍依赖预设的背景信息或槽分配机制。我们提出Loci-分割(Loci-s),该模型建立在基于槽的位置与身份追踪架构Loci(Traub等,ICLR 2023)之上。Loci-s实现了以下动态机制:(i) 通过前景识别模块与背景再生模块进行背景处理;(ii) 自顶向下修正的目标驱动自底向上处理;(iii) 深度估计生成。我们还通过槽-位置-实体正则化机制与先验分割网络改进了自动槽分配功能。实验结果表明,Loci-s在MOVi数据集及另一面向场景分割的成熟数据集集合中展现出卓越的视频分解性能。在多目标视频数据集MOVi-E上,即使未使用监督槽分配与背景信息,Loci-s的交并比(IoU)分数仍以显著优势超越当前最优方法。此外,我们证实Loci-s可生成具有强可解释性的潜在表征。这些表征可作为类基础模型的可解释基元,用于解决语言锚定、组合规则构建、单次强化学习任务等下游任务。