The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
翻译:音频研究社区依赖于开放生成模型作为构建新方法和建立基线的基础工具。本报告介绍了Sony AI公开发布的音效基础模型Woosh,详细阐述了其架构、训练过程,并与其他主流开放模型进行了对比评估。针对音效优化,我们提供了(1)高质量音频编码器/解码器模型与(2)用于条件控制的文本-音频对齐模型,同时包含(3)文本到音频与(4)视频到音频生成模型。发布的版本中还包含蒸馏后的文本到音频与视频到音频模型,可实现低资源运行与快速推理。在公开及私有数据上的评估显示,与StableAudio-Open、TangoFlux等现有开放替代方案相比,各模块均具备相当或更优的性能。推理代码与模型权重已发布于https://github.com/SonyResearch/Woosh,演示样本请访问https://sonyresearch.github.io/Woosh/。