When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views are jointly inferred with a Transformer and then are input to a sequential extension of Slot Attention to learn object-centric representations. In addition, Gaussian processes are employed as priors of view latent variables for video generation and novel-view prediction without viewpoint annotations. Experiments on multiple datasets demonstrate that the proposed model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions.
翻译:在多视角感知世界时,人类具备以组合方式推理完整物体的能力,即使物体在某些视角下完全被遮挡。同时,人类在观察多个视角后能够想象出新视角。近期多视角目标中心学习的显著进展仍存在未解决的问题:1)部分或完全遮挡物体的形状无法被良好重建;2)新视角预测依赖于昂贵的视角标注而非视角表示中的隐含规则。本文提出一种基于时间条件的视频生成模型。为准确重建物体的完整形状,我们增强了物体与视角潜在表示之间的解耦性——将时间条件视角的潜在表示通过Transformer联合推断后,输入到序列扩展的Slot Attention中以学习目标中心表示。此外,采用高斯过程作为视角潜变量的先验,无需视角标注即可实现视频生成与新视角预测。多数据集实验表明,所提模型能够完成目标中心的视频分解、重建被遮挡物体的完整形状,并实现新视角预测。