Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation. Project page: https://shira-schiber.github.io/TempoControl/.
翻译:近期生成式视频模型的进展使得基于自然语言提示生成高质量视频成为可能。然而,这些模型通常缺乏精细的时间控制能力,即无法让用户指定特定视觉元素在生成序列中出现的时间。在本研究中,我们提出TempoControl方法,该方法无需重新训练或额外监督,即可在推理过程中实现视觉概念的时间对齐。TempoControl利用文本到视频扩散模型的关键组件——交叉注意力图,通过一种新颖的优化方法引导概念出现的时间。我们的方法基于三个互补原则来调控注意力:将其时间模式与控制信号对齐(相关性),在需要可见性的区域调整其强度(幅度),以及保持语义一致性(熵)。TempoControl在保持高视频质量和多样性的同时,提供了精确的时间控制。我们通过多种应用展示了其有效性,包括单个及多个对象的时间重排、动作时序控制以及音频对齐视频生成。项目页面:https://shira-schiber.github.io/TempoControl/。