Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.
翻译:现实世界中的视频由一系列事件组成。现有视频生成器依赖单段文本作为输入,难以实现对这类序列的精确时间控制。当需要生成由单个提示描述的多事件时,此类方法往往会忽略部分事件或无法按正确顺序排列它们。为突破这一局限,我们提出了MinT——一种具备时间控制能力的多事件视频生成器。我们的核心思路是将每个事件绑定到生成视频的特定时段,使模型能够每次专注于一个事件。为实现事件描述与视频标记之间的时间感知交互,我们设计了一种基于时间的位置编码方法ReRoPE。该编码有助于引导交叉注意力操作。通过在时间标注数据上对预训练的视频扩散Transformer进行微调,我们的方法能够生成事件平滑衔接的连贯视频。本模型首次在学界实现了对生成视频中事件时序的精确控制。大量实验表明,MinT以显著优势超越现有开源模型。