This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation. Specifically, we propose a simple pipeline to leverage in-context generation: ($\textbf{i}$) concatenate videos along spacial or time dimension, ($\textbf{ii}$) jointly caption multi-scene video clips from one source, and ($\textbf{iii}$) apply task-specific fine-tuning using carefully curated small datasets. Through a series of diverse controllable tasks, we demonstrate qualitatively that existing advanced text-to-video models can effectively perform in-context generation. Notably, it allows for the creation of consistent multi-scene videos exceeding 30 seconds in duration, without additional computational overhead. Importantly, this method requires no modifications to the original models, results in high-fidelity video outputs that better align with prompt specifications and maintain role consistency. Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems. The data, code, and model weights are publicly available at: \url{https://github.com/feizc/Video-In-Context}.
翻译:本文研究了一种实现视频扩散Transformer上下文学习能力的方法,该方法仅需极少量调优即可激活。具体而言,我们提出了一个简单的上下文生成流程:($\textbf{i}$)沿空间或时间维度拼接视频,($\textbf{ii}$)对来自同一源的多场景视频片段进行联合描述,以及($\textbf{iii}$)使用精心构建的小型数据集进行任务特定的微调。通过一系列多样化的可控任务,我们定性地证明了现有先进的文本到视频模型能够有效执行上下文生成。值得注意的是,该方法能够创建时长超过30秒、具有一致性的多场景视频,且无需额外的计算开销。重要的是,此方法无需修改原始模型,即可生成高保真度的视频输出,这些输出更符合提示词规范并保持角色一致性。我们的框架为研究社区提供了一个有价值的工具,并为推进产品级可控视频生成系统提供了关键见解。数据、代码和模型权重已在以下地址公开:\url{https://github.com/feizc/Video-In-Context}。