Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y-ren16.github.io/STAV2A.
翻译:视觉与听觉感知是人类体验世界的两种关键方式。过去一年,文本到视频生成技术取得了显著进展,但生成视频中缺乏和谐音频限制了其更广泛的应用。本文提出语义与时序对齐的视频到音频生成方法(STA-V2A),该方法通过提取视频的局部时序特征与全局语义特征,并将这些精炼后的视频特征与文本结合作为跨模态引导,从而增强从视频生成音频的能力。为解决视频信息冗余问题,我们提出用于局部时序特征提取的起始点预测前置任务,以及用于全局语义特征提取的注意力池化模块。为补充视频中不足的语义信息,我们提出采用文本到音频先验初始化与跨模态引导的潜扩散模型。我们还引入了音频-音频对齐这一新指标来评估音频与时序的对齐程度。主客观评估指标表明,我们的方法在生成音频的质量、语义一致性和时序对齐方面均优于现有视频到音频生成模型。消融实验验证了各模块的有效性。音频样本可在 https://y-ren16.github.io/STAV2A 获取。