We propose a new object-centric video prediction algorithm based on the deep latent particle (DLP) representation. In comparison to existing slot- or patch-based representations, DLPs model the scene using a set of keypoints with learned parameters for properties such as position and size, and are both efficient and interpretable. Our method, deep dynamic latent particles (DDLP), yields state-of-the-art object-centric video prediction results on several challenging datasets. The interpretable nature of DDLP allows us to perform ``what-if'' generation -- predict the consequence of changing properties of objects in the initial frames, and DLP's compact structure enables efficient diffusion-based unconditional video generation. Videos, code and pre-trained models are available: https://taldatech.github.io/ddlp-web
翻译:我们提出一种基于深度潜变量粒子(DLP)表示的新型对象中心视频预测算法。与现有的槽或补丁表示方法相比,DLP使用一组具有可学习参数(如位置和大小等属性)的关键点来建模场景,兼具高效性和可解释性。我们的方法——深度动态潜变量粒子(DDLP)——在多个具有挑战性的数据集上取得了领先的对象中心视频预测结果。DDLP的可解释性使其能够执行"假设"生成——预测初始帧中对象属性变化带来的后果,而DLP的紧凑结构则实现了基于扩散模型的高效无条件视频生成。视频、代码与预训练模型已公开:https://taldatech.github.io/ddlp-web