STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin,Wei Liu,Chen Chen,Jiasen Lu,Wenze Hu,Tsu-Jui Fu,Jesse Allardice,Zhengfeng Lai,Liangchen Song,Bowen Zhang,Cha Chen,Yiran Fei,Lezhi Li,Yizhou Sun,Kai-Wei Chang,Yinfei Yang

The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.

翻译：视频生成领域已取得显著进展，但当前仍迫切需要一套清晰、系统化的方案来指导稳健且可扩展模型的开发。本研究通过系统性探索模型架构、训练方案与数据策展策略之间的相互作用，提出了一种简单且可扩展的文本-图像条件视频生成方法STIV。该框架通过帧替换机制将图像条件整合至扩散Transformer（DiT）中，同时通过联合图像-文本条件分类器自由引导实现文本条件融合。此设计使STIV能同时执行文本到视频（T2V）与文本-图像到视频（TI2V）任务。此外，STIV可轻松扩展至视频预测、帧插值、多视角生成及长视频生成等多种应用场景。通过对T2I、T2V和TI2V任务的全面消融实验，STIV在保持简洁设计的同时展现出卓越性能：512分辨率下的87亿参数模型在VBench T2V评测中获得83.1分，超越CogVideoX-5B、Pika、Kling、Gen-3等主流开源与闭源模型；同规模模型在VBench I2V任务（512分辨率）亦取得90.1分的先进结果。通过提供透明且可扩展的前沿视频生成模型构建方案，本研究旨在推动未来研究发展，加速实现更通用、可靠的视频生成解决方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日