Improving Controllability and Editability for Pretrained Text-to-Music Generation Models

The field of AI-assisted music creation has made significant strides, yet existing systems often struggle to meet the demands of iterative and nuanced music production. These challenges include providing sufficient control over the generated content and allowing for flexible, precise edits. This thesis tackles these issues by introducing a series of advancements that progressively build upon each other, enhancing the controllability and editability of text-to-music generation models. First, we introduce Loop Copilot, a system that tries to address the need for iterative refinement in music creation. Loop Copilot leverages a large language model (LLM) to coordinate multiple specialised AI models, enabling users to generate and refine music interactively through a conversational interface. Central to this system is the Global Attribute Table, which records and maintains key musical attributes throughout the iterative process, ensuring that modifications at any stage preserve the overall coherence of the music. While Loop Copilot excels in orchestrating the music creation process, it does not directly address the need for detailed edits to the generated content. To overcome this limitation, MusicMagus is presented as a further solution for editing AI-generated music. MusicMagus introduces a zero-shot text-to-music editing approach that allows for the modification of specific musical attributes, such as genre, mood, and instrumentation, without the need for retraining. By manipulating the latent space within pre-trained diffusion models, MusicMagus ensures that these edits are stylistically coherent and that non-targeted attributes remain unchanged. This system is particularly effective in maintaining the structural integrity of the music during edits, but it encounters challenges with more complex and real-world audio scenarios. ...

翻译：人工智能辅助音乐创作领域已取得显著进展，但现有系统往往难以满足迭代式、精细化音乐制作的需求。这些挑战包括对生成内容提供足够的控制能力，以及支持灵活、精确的编辑操作。本论文通过引入一系列递进式创新技术来解决这些问题，逐步增强文本到音乐生成模型的可控性与可编辑性。首先，我们提出Loop Copilot系统，旨在应对音乐创作中迭代优化的需求。该系统利用大语言模型（LLM）协调多个专用AI模型，使用户能够通过对话式界面交互式地生成与优化音乐。该系统的核心是全局属性表，它在整个迭代过程中记录并维护关键音乐属性，确保任何阶段的修改都能保持音乐的整体连贯性。尽管Loop Copilot在编排音乐创作流程方面表现出色，但并未直接解决对生成内容进行细节编辑的需求。为突破这一限制，我们进一步提出MusicMagus作为AI生成音乐的编辑解决方案。MusicMagus引入了一种零样本文本到音乐编辑方法，允许修改特定音乐属性（如流派、情绪和乐器配置），而无需重新训练模型。通过操控预训练扩散模型中的潜在空间，MusicMagus确保这些编辑在风格上保持连贯，且非目标属性保持不变。该系统在编辑过程中能有效维持音乐的结构完整性，但在处理更复杂、更接近真实场景的音频时仍面临挑战。...

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日