WavJourney: Compositional Audio Creation with Large Language Models

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

翻译：大语言模型（LLMs）在整合多样专家模型以应对复杂语言与视觉任务方面展现出巨大潜力。尽管其在推动人工智能生成内容（AIGC）领域发展中具有重要意义，但其在智能音频内容创作中的潜力尚未被探索。本研究针对如何根据文本指令，创作包含语音、音乐及音效的叙事性音频内容这一课题展开研究。我们提出WavJourney系统，该系统利用大语言模型连接多种音频模型以实现音频内容生成。给定听觉场景的文本描述后，WavJourney首先引导大语言模型生成面向音频叙事的结构化脚本。该音频脚本整合了多种音频元素，并根据其时空关系进行组织。作为音频的概念化表征，该脚本为人类参与提供了可交互、可解释的推理依据。随后，音频脚本被输入脚本编译器，转化为计算机程序。程序中的每一行代码均调用特定任务的音频生成模型或计算操作函数（如拼接、混音）。最终执行该计算机程序，获得可解释的音频生成方案。我们通过科幻、教育及广播剧等多种真实场景验证了WavJourney的实用性。该系统的可解释性与交互式设计支撑了多轮对话中的人机协同创作，增强了音频生产的创意可控性与适应性。WavJourney实现了人类想象力的音频化表达，为多媒体内容创作开辟了新的创新路径。