When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views are jointly inferred with a Transformer and then are input to a sequential extension of Slot Attention to learn object-centric representations. In addition, Gaussian processes are employed as priors of view latent variables for video generation and novel-view prediction without viewpoint annotations. Experiments on multiple datasets demonstrate that the proposed model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions.
翻译:从多视角感知世界时,人类能够以组合方式推理完整物体,即使物体在某些视角下完全被遮挡。同时,人类在观察多个视角后能够想象新视角。近期多视角目标中心学习的显著进展仍存在未解决的问题:1)部分或完全遮挡物体的形状无法被良好重建;2)新视角预测依赖于昂贵的视角标注而非视图表征中的隐式规则。本文提出一种用于视频的时间条件化生成模型。为准确重建物体的完整形状,我们增强了物体与视角潜在表征之间的解耦性,其中时间条件化视角的潜在表征通过Transformer联合推断,并输入至Slot Attention的序列扩展中学习目标中心表征。此外,采用高斯过程作为视角潜在变量的先验,实现无需视角标注的视频生成与新视角预测。多数据集实验表明,所提模型能够进行目标中心视频分解,重建被遮挡物体的完整形状,并实现新视角预测。