This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.
翻译:本文介绍了LLM2Fx-Tools,一种多模态工具调用框架,用于为音乐后期制作生成可执行的音频效果序列(效果链)。该框架利用大语言模型(LLM)理解音频输入,在思维链(CoT)规划的引导下选择音频效果类型、确定其顺序并估算参数。我们还提出了LP-Fx,一个包含结构化CoT标注和音频效果模块工具调用的新型指令跟随数据集。实验表明,通过自回归序列建模、工具调用和CoT推理,LLM2Fx-Tools能够从未处理与已处理的音频对中推断出效果链及其参数。我们进一步在风格迁移场景中验证了该系统,即将音频效果信息从参考源迁移并应用于新内容。最后,基于LLM的评估表明,我们的方法能够为音乐制作查询生成恰当的CoT推理与响应。据我们所知,这是首次将基于LLM的工具调用应用于音频效果模块,实现了可解释且可控的音乐制作。