Recent compositional scene representation learning models have become remarkably good in segmenting and tracking distinct objects within visual scenes. Yet, many of these models require that objects are continuously, at least partially, visible. Moreover, they tend to fail on intuitive physics tests, which infants learn to solve over the first months of their life. Our goal is to advance compositional scene representation algorithms with an embedded algorithm that fosters the progressive learning of intuitive physics, akin to infant development. As a fundamental component for such an algorithm, we introduce Loci-Looped, which advances a recently published unsupervised object location, identification, and tracking neural network architecture (Loci, Traub et al., ICLR 2023) with an internal processing loop. The loop is designed to adaptively blend pixel-space information with anticipations yielding information-fused activities as percepts. Moreover, it is designed to learn compositional representations of both individual object dynamics and between-objects interaction dynamics. We show that Loci-Looped learns to track objects through extended periods of object occlusions, indeed simulating their hidden trajectories and anticipating their reappearance, without the need for an explicit history buffer. We even find that Loci-Looped surpasses state-of-the-art models on the ADEPT and the CLEVRER dataset, when confronted with object occlusions or temporary sensory data interruptions. This indicates that Loci-Looped is able to learn the physical concepts of object permanence and inertia in a fully unsupervised emergent manner. We believe that even further architectural advancements of the internal loop - also in other compositional scene representation learning models - can be developed in the near future.
翻译:最近的组合式场景表示学习模型在分割和跟踪视觉场景中的不同物体方面已变得非常出色。然而,其中许多模型要求物体持续(至少部分)可见。此外,它们在直觉物理测试中往往失败,而婴儿在生命最初几个月就学会解决这类问题。我们的目标是推进组合式场景表示算法,其中嵌入一种算法,以促进直觉物理的渐进式学习,类似于婴儿的发育。作为此类算法的一个基本组成部分,我们引入了Loci-Looped,该算法通过内部处理循环,改进了最近发表的无监督物体定位、识别和跟踪神经网络架构(Loci, Traub等, ICLR 2023)。该循环旨在自适应地将像素空间信息与预测相结合,生成信息融合的活动作为感知。此外,它还旨在学习单个物体动力学以及物体间交互动力学的组合式表示。我们证明,Loci-Looped能够学会在物体长时间遮挡期间跟踪物体,模拟其隐藏轨迹并预测其重新出现,而无需显式的历史缓冲区。我们甚至发现,当面对物体遮挡或暂时性感官数据中断时,Loci-Looped在ADEPT和CLEVRER数据集上超越了最先进的模型。这表明Loci-Looped能够以完全无监督的涌现方式学习物体恒存性和惯性等物理概念。我们相信,在不久的将来,内部循环的进一步架构改进——也在其他组合式场景表示学习模型中——是可以实现的。