Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional encoder (e.g., CLAP) to extract a global representation, thereby reducing the number of model parameters. Our results show that incorporating the CLAP global embeddings to the T5 local embeddings enhances text adherence (KL=1.47) compared to a baseline model solely relying on the T5 local embeddings (KL=1.54). Alternatively, extracting global text embeddings directly from the T5 local embeddings through the proposed mean pooling approach yields superior generation quality (FAD=1.89) while exhibiting marginally inferior text adherence (KL=1.51) against the model conditioned on both CLAP and T5 text embeddings (FAD=1.94 and KL=1.47). Our proposed solution is not only efficient but also compact in terms of the number of parameters required.
翻译:基于扩散的文本到音乐(TTM)模型可根据文本描述生成相应音乐。典型的基于UNet的扩散模型通常以预训练大语言模型生成的文本嵌入或跨模态音频-语言表征模型生成的嵌入作为条件。本研究提出一种基于扩散的TTM模型,其UNet同时接受两种条件输入:(i)通过交叉注意力机制引入单模态语言模型(如T5)的文本表征;(ii)通过特征线性调制(FiLM)引入跨模态音频-语言表征模型(如CLAP)的全局表征。该扩散模型经过训练,可同时利用T5提供的局部文本表征和CLAP提供的全局表征。此外,我们提出改进方案,通过均值池化和自注意力池化机制从T5中同时提取全局与局部表征。该方法无需额外编码器(如CLAP)提取全局表征,从而减少了模型参数量。实验结果表明:相较于仅使用T5局部嵌入的基线模型(KL=1.54),结合CLAP全局嵌入与T5局部嵌入能提升文本遵循度(KL=1.47)。另一方面,通过所提均值池化方法直接从T5局部嵌入提取全局文本嵌入,在生成质量上表现更优(FAD=1.89),虽然其文本遵循度(KL=1.51)略逊于同时使用CLAP和T5文本嵌入的模型(FAD=1.94,KL=1.47)。我们提出的解决方案不仅高效,在所需参数量方面也更为紧凑。