We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
翻译:我们提出MMAudioSep,一种基于预训练视频到音频模型的视频/文本查询声音分离生成模型。通过利用预训练音频生成模型习得的视频/文本与音频间关系知识,模型训练效率得以提升——即无需从头开始训练。通过与现有分离模型(包括基于确定性和生成方法的模型)进行性能对比,我们发现MMAudioSep优于基线模型。此外,我们证明即便通过微调获得声音分离功能后,模型仍保留原有的视频到音频生成能力。这凸显了基础声音生成模型应用于声音相关下游任务的潜力。我们的代码开源在https://github.com/sony/mmaudiosep。