Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
翻译:近期视频到音频生成方法取得了显著进展,能够合成逼真、高质量的音频。然而,在多事件场景或视觉线索不足的情况下(如小区域、画外音、被遮挡或部分可见物体),这些方法在细粒度时间控制方面仍存在困难。本文提出FoleyDirector框架,首次在基于DiT的视频到音频生成中实现精确的时间引导,同时保持基础模型的音频质量,并允许在视频到音频生成与时间控制合成之间无缝切换。FoleyDirector引入结构化时间脚本(STS),即对应短时间片段的一组描述文本,以提供更丰富的时间信息。这些特征通过脚本引导时间融合模块整合,该模块采用时间脚本注意力机制,将STS特征进行连贯融合。为处理复杂的多事件场景,我们进一步提出双向帧声音合成方法,实现帧内与帧外音频的并行生成,增强可控性。为支持训练与评估,我们构建了DirectorSound数据集,并引入VGGSoundDirector与DirectorBench基准。实验表明,FoleyDirector在保持高音频保真度的同时显著增强了时间可控性,使用户能够像导演一样操控音频,推动视频到音频生成向更具表现力与可控性的方向演进。