Inference-based GAN Video Generation

Video generation has seen remarkable progress thanks to advancements in generative deep learning. However, generating long sequences remains a significant challenge. Generated videos should not only display coherent and continuous movement but also meaningful movement in successions of scenes. Models such as GANs, VAEs, and Diffusion Networks have been used for generating short video sequences, typically up to 16 frames. In this paper, we first propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder, akin to a VAE-GAN hybrid structure. The proposed model, as in other video deep learning-based processing frameworks, incorporates two processing branches, one for content and another for movement. However, existing models struggle with the temporal scaling of the generated videos. Classical approaches often result in degraded video quality when attempting to increase the generated video length, especially for significantly long sequences. To overcome this limitation, our research study extends the initially proposed VAE-GAN video generation model by employing a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames ensuring their temporal continuity, consistency and dynamics. Our approach leverages a Markov chain framework with a recall mechanism, where each state represents a short-length VAE-GAN video generator. This setup enables the sequential connection of generated video sub-sequences, maintaining temporal dependencies and resulting in meaningful long video sequences.

翻译：得益于生成式深度学习的进步，视频生成已取得显著进展。然而，生成长序列视频仍然是一个重大挑战。生成的视频不仅需要呈现连贯连续的运动，还应在连续场景中展现有意义的动态。诸如GAN、VAE和扩散网络等模型已被用于生成短视频序列，通常不超过16帧。本文首先提出一种新型视频生成器，通过为基于对抗的无条件视频生成器配备变分编码器来实现，类似于VAE-GAN混合结构。与其它基于深度学习的视频处理框架类似，所提模型包含两个处理分支：一个用于内容生成，另一个用于运动生成。然而，现有模型在生成视频的时间尺度扩展方面存在困难。传统方法在尝试增加生成视频长度时往往导致视频质量下降，特别是对于极长序列。为克服这一限制，本研究通过采用一种新颖的内存高效方法扩展了最初提出的VAE-GAN视频生成模型，能够生成由数百或数千帧组成的长视频，并确保其时间连续性、一致性和动态性。我们的方法利用带有记忆机制的马尔可夫链框架，其中每个状态代表一个短长度的VAE-GAN视频生成器。这种设置实现了生成视频子序列的序列化连接，保持了时间依赖性，从而产生有意义的长视频序列。