WavJourney: Compositional Audio Creation with Large Language Models

Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages Large Language Models (LLMs) to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program, where each line of the program calls a task-specific audio generation model or computational operation function. The computer program is then executed to obtain a compositional and interpretable solution for audio creation. Experimental results suggest that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions, achieving state-of-the-art results on text-to-audio generation benchmarks. Additionally, we introduce a new multi-genre story benchmark. Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text. We further demonstrate that WavJourney can facilitate human-machine co-creation in multi-round dialogues. To foster future research, the code and synthesized audio are available at: https://audio-agi.github.io/WavJourney_demopage/.

翻译：尽管音频生成模型取得了突破性进展，但其能力往往局限于特定领域的条件约束，例如语音转录和音频描述。然而，现实世界的音频创作旨在生成融合语音、音乐、音效等多重元素的和谐音频，并支持可控条件，现有音频生成系统难以应对这一挑战。本文提出WavJourney——一种利用大语言模型（LLMs）连接多种音频模型的新型框架。WavJourney允许用户仅通过文本描述即可创建包含多种音频元素的叙事性音频内容。具体而言，给定文本指令后，WavJourney首先引导LLMs生成音频脚本，该脚本作为音频元素的结构化语义表征。随后将音频脚本转换为计算机程序，程序中的每一行调用特定任务的音频生成模型或计算操作函数。执行该计算机程序即可获得组合式且可解释的音频创作方案。实验结果表明，WavJourney能够合成与文本描述的语义、空间及时间条件高度一致的真实音频，在文本到音频生成基准测试中达到业界领先水平。此外，我们引入了新的多体裁故事基准。主观评估显示，WavJourney在从文本生成引人入胜的叙事性音频内容方面具有巨大潜力。我们进一步证明WavJourney能够通过多轮对话促进人机协同创作。为促进未来研究，相关代码与合成音频已开源：https://audio-agi.github.io/WavJourney_demopage/