YingSound：基于多模态思维链控制的视频引导音效生成 (YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls)

Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/}

翻译：为产品级视频生成音效，在多样场景下仅有少量标注数据可用，这要求在少样本设置下产生高质量声音。为解决现实场景中标注数据有限的挑战，我们提出了YingSound，一个专为视频引导声音生成设计的基础模型，支持在少样本设置下生成高质量音频。具体而言，YingSound包含两个主要模块。第一个模块采用条件流匹配Transformer，以实现音频与视觉模态间声音生成的有效语义对齐。该模块旨在构建一个可学习的视听聚合器，将高分辨率视觉特征与对应音频特征在多阶段进行融合。第二个模块基于提出的多模态视觉-音频思维链方法开发，以在少样本设置下生成更精细的音效。最后，我们提出了一个涵盖多种现实场景的工业标准视频到音频数据集。通过自动化评估和人工研究，我们证明YingSound能够基于多样化的条件输入，有效生成高质量且同步的声音。项目页面：\url{https://giantailab.github.io/yingsound/}

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日