In this paper, we study video synthesis with emphasis on simplifying the generation conditions. Most existing video synthesis models or datasets are designed to address complex motions of a single object, lacking the ability of comprehensively understanding the spatio-temporal relationships among multiple objects. Besides, current methods are usually conditioned on intricate annotations (e.g. video segmentations) to generate new videos, being fundamentally less practical. These motivate us to generate multi-object videos conditioning exclusively on object layouts from a single frame. To solve above challenges and inspired by recent research on image generation from layouts, we have proposed a novel video generative framework capable of synthesizing global scenes with local objects, via implicit neural representations and layout motion self-inference. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. In addition, our model has been evaluated on two widely-used video recognition benchmarks, demonstrating effectiveness compared to the baseline model.
翻译:本文研究视频合成,着重简化生成条件。现有大多数视频合成模型或数据集仅针对单个物体的复杂运动,缺乏全面理解多物体时空关系的能力。此外,当前方法通常依赖复杂标注(如视频分割)来生成新视频,实用性较差。这些问题促使我们探索仅以单帧中物体布局为条件来生成多目标视频的方法。为解决上述挑战,受近期从布局生成图像研究的启发,我们提出了一种新颖的视频生成框架,通过隐式神经表示和布局运动自推断,能够合成包含局部物体的全局场景。我们的框架是对图像生成方法的非平凡改编,在该领域具有创新性。此外,我们的模型在两个广泛使用的视频识别基准上进行了评估,结果表明其相较于基线模型具有更优效果。