Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: https://vchitect.github.io/SEINE-project/ .
翻译:近期视频生成在逼真效果方面取得了显著进展。然而,现有AI生成的视频通常为描述单一场景的极短视频片段(“镜头级”)。要生成连贯的长视频(“故事级”),需要在不同片段间实现富有创意的过渡与预测效果。本文提出了一种短到长视频扩散模型SEINE,专注于生成式过渡与预测任务,旨在生成具有流畅创意场景过渡、不同长度镜头级视频的高质量长视频。具体而言,我们提出随机遮罩视频扩散模型,基于文本描述自动生成过渡视频。通过输入不同场景图像并结合文本控制,该模型生成的过渡视频在保持连贯性与视觉质量的同时,可轻松扩展至图像到视频动画及自回归视频预测等任务。为全面评估这一新型生成任务,我们提出了三项评估过渡平滑性与创造性的指标:时间一致性、语义相似度及视频-文本语义对齐。大量实验表明,本方法在生成式过渡与预测任务上优于现有方法,实现了故事级长视频的创作。项目页面:https://vchitect.github.io/SEINE-project/。