Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/}
翻译:为产品级视频生成音效,在多样场景下仅有少量标注数据可用,这要求在少样本设置下产生高质量声音。为解决现实场景中标注数据有限的挑战,我们提出了YingSound,一个专为视频引导声音生成设计的基础模型,支持在少样本设置下生成高质量音频。具体而言,YingSound包含两个主要模块。第一个模块采用条件流匹配Transformer,以实现音频与视觉模态间声音生成的有效语义对齐。该模块旨在构建一个可学习的视听聚合器,将高分辨率视觉特征与对应音频特征在多阶段进行融合。第二个模块基于提出的多模态视觉-音频思维链方法开发,以在少样本设置下生成更精细的音效。最后,我们提出了一个涵盖多种现实场景的工业标准视频到音频数据集。通过自动化评估和人工研究,我们证明YingSound能够基于多样化的条件输入,有效生成高质量且同步的声音。项目页面:\url{https://giantailab.github.io/yingsound/}