StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

翻译：对于当前基于扩散的生成模型而言，在包含主体与复杂细节的一系列生成图像中保持内容一致性是一项重大挑战。本文提出了一种名为一致性自注意力机制（Consistent Self-Attention）的自注意力计算新方法，该方法能显著提升生成图像间的一致性，并以零样本方式增强现有预训练扩散文本到图像模型。为将所提方法扩展至长程视频生成，我们进一步引入了一种新型语义空间时序运动预测模块——语义运动预测器（Semantic Motion Predictor）。该模块经过训练，可在语义空间中估计两幅给定图像间的运动条件，从而将生成的图像序列转换为具有平滑过渡且主体一致的视频，其稳定性显著优于仅基于隐空间的模块，尤其在长视频生成场景下表现更为突出。通过融合这两项创新组件，我们的框架（称为StoryDiffusion）能够用一致的图像或视频描述基于文本的故事，涵盖丰富多样的内容。所提出的StoryDiffusion在视觉故事生成领域开展了开创性探索，以图像与视频形式呈现内容，期望能从架构改进层面激发更多研究。我们的代码已开源至https://github.com/HVision-NKU/StoryDiffusion。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日