Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.
翻译:文本到视频(T2V)生成在生成包含多个事件的长时间跨度视频时面临挑战性问题。受扩散过程内在特性的启发,我们探究了视频扩散Transformer(DiT)的去噪轨迹,发现其中存在关键转折点——在此处,条件文本对生成的影响从全局布局过渡到细节刻画。基于此发现,我们提出TunerDiT,一种简单有效的渐进式引导方法,无需额外训练即可实现多事件生成。TunerDiT包含两个引导机制:(1)事件分区掩码,用于强制事件边界同时允许跨事件过渡带;(2)跨事件提示融合,在后期优化阶段注入相邻事件语义。我们贡献了一套自标注的提示集Meve用于多事件生成的基准测试。与其他免训练方法相比,TunerDiT在8项指标上达到最优性能,并在视频一致性与事件分离之间提供可调权衡。文本对齐度的提升随事件数量增加而增强,表明该方法具备随事件数量扩展的潜力。