Slot-oriented processing approaches for compositional scene representation have recently undergone a tremendous development. We present Loci-Segmented (Loci-s), an advanced scene segmentation neural network that extends the slot-based location and identity tracking architecture Loci (Traub et al., ICLR 2023). The main advancements are (i) the addition of a pre-trained dynamic background module; (ii) a hyper-convolution encoder module, which enables object-focused bottom-up processing; and (iii) a cascaded decoder module, which successively generates object masks, masked depth maps, and masked, depth-map-informed RGB reconstructions. The background module features the learning of both a foreground identifying module and a background re-generator. We further improve performance via (a) the integration of depth information as well as improved slot assignments via (b) slot-location-entity regularization and (b) a prior segmentation network. Even without these latter improvements, the results reveal superior segmentation performance in the MOVi datasets and in another established dataset collection. With all improvements, Loci-s achieves a 32% better intersection over union (IoU) score in MOVi-E than the previous best. We furthermore show that Loci-s generates well-interpretable latent representations. We believe that these representations may serve as a foundation-model-like interpretable basis for solving downstream tasks, such as grounding language and context- and goal-conditioned event processing.
翻译:面向组合场景表征的槽位处理方式近年来取得了显著发展。本文提出Loci-Segmented(Loci-s),一种先进场景分割神经网络,它扩展了基于槽位的定位与身份追踪架构Loci(Traub等人,ICLR 2023)。主要改进包括:(i)增加预训练动态背景模块;(ii)超卷积编码器模块,实现以对象为中心的自底向上处理;(iii)级联解码器模块,顺序生成对象掩膜、带掩膜的深度图以及基于深度图信息重建的RGB图像。背景模块同时训练前景识别模块与背景重构生成器。此外,我们通过(a)整合深度信息,以及(b)槽位-位置-实体正则化与先验分割网络改进槽位分配,进一步提升性能。即便不采用这些后期优化,实验结果在MOVi数据集及其他经典数据集中仍展现出优越的分割性能。整合所有改进后,Loci-s在MOVi-E数据集上的交并比(IoU)得分较之前最优结果提升32%。我们还证明Loci-s可生成高度可解释的潜在表征。我们认为,这些表征可作为类似基础模型的可解释基准,用于解决下游任务,如语言接地及基于上下文与目标条件的事件处理。