SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces

Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with the length of the sequence. This limitation presents significant challenges when attempting to generate longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101, a standard benchmark of video generation. In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate dataset, varying the number of frames to 64 and 150. In these settings, our SSM-based model can considerably save memory consumption for longer sequences, while maintaining competitive FVD scores to the attention-based models. Our codes are available at https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.

翻译：鉴于扩散模型在图像生成领域取得的显著成就，研究界对将其拓展至视频生成的兴趣日益增长。近期用于视频生成的扩散模型主要采用注意力层来提取时序特征。然而，注意力层受限于其内存消耗——该消耗随序列长度呈二次方增长。这一限制使得尝试用扩散模型生成更长视频序列时面临重大挑战。为攻克这一难题，我们提出利用状态空间模型（SSM）作为替代方案。SSM因其内存消耗与序列长度呈线性关系的特性，近来作为可行替代方案受到广泛关注。实验中，我们首先在视频生成标准基准UCF101上评估了基于SSM的模型。此外，为探究SSM在长视频生成中的潜力，我们使用MineRL Navigate数据集进行了实验，将帧数分别设置为64和150。在这些设置下，基于SSM的模型在保持与基于注意力模型相竞争FVD得分的同时，能显著降低长序列的内存消耗。我们的代码已开源至https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日