Current slot-oriented approaches for compositional scene segmentation from images and videos rely on provided background information or slot assignments. We present a segmented location and identity tracking system, Loci-Segmented (Loci-s), which does not require either of this information. It learns to dynamically segment scenes into interpretable background and slot-based object encodings, separating rgb, mask, location, and depth information for each. The results reveal largely superior video decomposition performance in the MOVi datasets and in another established dataset collection targeting scene segmentation. The system's well-interpretable, compositional latent encodings may serve as a foundation model for downstream tasks.
翻译:当前基于槽的、用于从图像和视频中进行组合场景分割的方法依赖于提供的背景信息或槽分配。我们提出了一种无需这些信息的分割式位置与身份跟踪系统——Loci-Segmented(Loci-s)。该系统能够学习将场景动态分割为可解释的背景和基于槽的对象编码,并分别分离每个对象的RGB、掩码、位置和深度信息。结果表明,它在MOVi数据集及另一个面向场景分割的成熟数据集集合中展现出显著更优的视频分解性能。该系统具备高度可解释的组合式潜在编码,可作为下游任务的基础模型。