The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.
翻译:视频生成领域已取得显著进展,但当前仍迫切需要一套清晰、系统化的方案来指导稳健且可扩展模型的开发。本研究通过系统性探索模型架构、训练方案与数据策展策略之间的相互作用,提出了一种简单且可扩展的文本-图像条件视频生成方法STIV。该框架通过帧替换机制将图像条件整合至扩散Transformer(DiT)中,同时通过联合图像-文本条件分类器自由引导实现文本条件融合。此设计使STIV能同时执行文本到视频(T2V)与文本-图像到视频(TI2V)任务。此外,STIV可轻松扩展至视频预测、帧插值、多视角生成及长视频生成等多种应用场景。通过对T2I、T2V和TI2V任务的全面消融实验,STIV在保持简洁设计的同时展现出卓越性能:512分辨率下的87亿参数模型在VBench T2V评测中获得83.1分,超越CogVideoX-5B、Pika、Kling、Gen-3等主流开源与闭源模型;同规模模型在VBench I2V任务(512分辨率)亦取得90.1分的先进结果。通过提供透明且可扩展的前沿视频生成模型构建方案,本研究旨在推动未来研究发展,加速实现更通用、可靠的视频生成解决方案。