The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
翻译:音频研究社区依赖开源生成模型作为构建新方法和建立基准的基础工具。本报告介绍了Sony AI公开发布的音效基础模型Woosh,详细阐述了其架构、训练过程,并与其他主流开源模型进行了对比评估。针对音效场景的优化,我们提供了:(1) 高质量音频编解码器模型;(2) 用于条件控制的文本-音频对齐模型;以及 (3) 文本转音频与 (4) 视频转音频生成模型。发布版本中同时包含蒸馏后的文本转音频与视频转音频模型,可支持低资源运行与快速推理。基于公开与私有数据的评估表明,各模块性能相较于StableAudio-Open和TangoFlux等现有开源方案具有竞争力或更优表现。推理代码与模型权重已发布于https://github.com/SonyResearch/Woosh,演示样本详见https://sonyresearch.github.io/Woosh/。