Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, current state-of-the-art video-guided audio generation models often fall short of producing high-quality audio for both general and specialized use cases. To address this challenge, we introduce a multi-stage, multi-modal, end-to-end generative framework with Chain-of-Thought-like (CoT-like) guidance learning, termed Chain-of-Perform (CoP). First, we employ a transformer-based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi-stage training framework that follows step-by-step guidance to ensure the generation of high-quality sound effects. Third, we develop a CoP multi-modal dataset, guided by video, to support step-by-step sound effects generation. Evaluation results highlight the advantages of the proposed multi-stage CoP generative framework compared to the state-of-the-art models on a variety of datasets, with FAD 0.79 to 0.74 (+6.33%), CLIP 16.12 to 17.70 (+9.80%) on VGGSound, SI-SDR 1.98dB to 3.35dB (+69.19%), MOS 2.94 to 3.49(+18.71%) on PianoYT-2h, and SI-SDR 2.22dB to 3.21dB (+44.59%), MOS 3.07 to 3.42 (+11.40%) on Piano-10h.
翻译:从视频和文本提示生成高质量音效,需要在语义和时序上实现视觉与音频领域的精确对齐,并辅以专业音频生成的逐步引导。然而,当前最先进的视频引导音频生成模型在通用和专用场景下往往难以产出高质量音频。为应对这一挑战,我们提出了一种多阶段、多模态、端到端的生成框架,其采用类思维链引导学习,称为"执行链"。首先,我们采用基于Transformer的网络架构来实现CoP引导,使其能够生成通用及专业音频。其次,我们实施了一个遵循逐步引导的多阶段训练框架,以确保高质量音效的生成。第三,我们构建了一个由视频引导的CoP多模态数据集,以支持逐步音效生成。评估结果凸显了所提出的多阶段CoP生成框架相较于多种数据集上最先进模型的优势:在VGGSound上,FAD从0.79降至0.74(提升6.33%),CLIP从16.12提升至17.70(提升9.80%);在PianoYT-2h上,SI-SDR从1.98dB提升至3.35dB(提升69.19%),MOS从2.94提升至3.49(提升18.71%);在Piano-10h上,SI-SDR从2.22dB提升至3.21dB(提升44.59%),MOS从3.07提升至3.42(提升11.40%)。