This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.
翻译:本文研究了文本到音频音乐生成模型在随时间变化的提示下生成长篇音乐的能力,重点关注桌面角色扮演游戏(TRPG)配乐的生成。我们提出了Babel Bardo系统,该系统利用大语言模型(LLMs)将语音转录转换为音乐描述,以控制文本到音乐模型。在两个TRPG战役中比较了四个版本的Babel Bardo:一个使用直接语音转录的基线版本,以及三个采用不同音乐描述生成方法的LLM版本。评估指标包括音频质量、故事契合度和过渡平滑度。结果表明,详细的音乐描述能提升音频质量,而保持连续描述间的一致性则有助于增强故事契合度和过渡平滑度。