Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with the length of the sequence. This limitation presents significant challenges when attempting to generate longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101, a standard benchmark of video generation. In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate dataset, varying the number of frames to 64 and 150. In these settings, our SSM-based model can considerably save memory consumption for longer sequences, while maintaining competitive FVD scores to the attention-based models. Our codes are available at https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.
翻译:鉴于扩散模型在图像生成领域取得的显著成就,研究界对将其拓展至视频生成的兴趣日益增长。近期用于视频生成的扩散模型主要采用注意力层来提取时序特征。然而,注意力层受限于其内存消耗——该消耗随序列长度呈二次方增长。这一限制使得尝试用扩散模型生成更长视频序列时面临重大挑战。为攻克这一难题,我们提出利用状态空间模型(SSM)作为替代方案。SSM因其内存消耗与序列长度呈线性关系的特性,近来作为可行替代方案受到广泛关注。实验中,我们首先在视频生成标准基准UCF101上评估了基于SSM的模型。此外,为探究SSM在长视频生成中的潜力,我们使用MineRL Navigate数据集进行了实验,将帧数分别设置为64和150。在这些设置下,基于SSM的模型在保持与基于注意力模型相竞争FVD得分的同时,能显著降低长序列的内存消耗。我们的代码已开源至https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models。