Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional encoder (e.g., CLAP) to extract a global representation, thereby reducing the number of model parameters. Our results show that incorporating the CLAP global embeddings to the T5 local embeddings enhances text adherence (KL=1.47) compared to a baseline model solely relying on the T5 local embeddings (KL=1.54). Alternatively, extracting global text embeddings directly from the T5 local embeddings through the proposed mean pooling approach yields superior generation quality (FAD=1.89) while exhibiting marginally inferior text adherence (KL=1.51) against the model conditioned on both CLAP and T5 text embeddings (FAD=1.94 and KL=1.47). Our proposed solution is not only efficient but also compact in terms of the number of parameters required.

翻译：基于扩散的文本到音乐（TTM）模型可根据文本描述生成相应音乐。典型的基于UNet的扩散模型通常以预训练大语言模型生成的文本嵌入或跨模态音频-语言表征模型生成的嵌入作为条件。本研究提出一种基于扩散的TTM模型，其UNet同时接受两种条件输入：（i）通过交叉注意力机制引入单模态语言模型（如T5）的文本表征；（ii）通过特征线性调制（FiLM）引入跨模态音频-语言表征模型（如CLAP）的全局表征。该扩散模型经过训练，可同时利用T5提供的局部文本表征和CLAP提供的全局表征。此外，我们提出改进方案，通过均值池化和自注意力池化机制从T5中同时提取全局与局部表征。该方法无需额外编码器（如CLAP）提取全局表征，从而减少了模型参数量。实验结果表明：相较于仅使用T5局部嵌入的基线模型（KL=1.54），结合CLAP全局嵌入与T5局部嵌入能提升文本遵循度（KL=1.47）。另一方面，通过所提均值池化方法直接从T5局部嵌入提取全局文本嵌入，在生成质量上表现更优（FAD=1.89），虽然其文本遵循度（KL=1.51）略逊于同时使用CLAP和T5文本嵌入的模型（FAD=1.94，KL=1.47）。我们提出的解决方案不仅高效，在所需参数量方面也更为紧凑。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日