Text-to-music models allow users to generate nearly realistic musical audio with textual commands. However, editing music audios remains challenging due to the conflicting desiderata of performing fine-grained alterations on the audio while maintaining a simple user interface. To address this challenge, we propose Audio Prompt Adapter (or AP-Adapter), a lightweight addition to pretrained text-to-music models. We utilize AudioMAE to extract features from the input audio, and construct attention-based adapters to feedthese features into the internal layers of AudioLDM2, a diffusion-based text-to-music model. With 22M trainable parameters, AP-Adapter empowers users to harness both global (e.g., genre and timbre) and local (e.g., melody) aspects of music, using the original audio and a short text as inputs. Through objective and subjective studies, we evaluate AP-Adapter on three tasks: timbre transfer, genre transfer, and accompaniment generation. Additionally, we demonstrate its effectiveness on out-of-domain audios containing unseen instruments during training.
翻译:文本到音乐模型允许用户通过文本指令生成近乎逼真的音乐音频。然而,由于需要在音频上进行细粒度修改的同时保持简洁的用户界面,音乐音频的编辑仍然具有挑战性。为应对这一挑战,我们提出了音频提示适配器(简称AP-Adapter),这是对预训练文本到音乐模型的轻量级补充。我们利用AudioMAE从输入音频中提取特征,并构建基于注意力的适配器将这些特征馈送到AudioLDM2(一种基于扩散的文本到音乐模型)的内部层中。AP-Adapter仅包含2200万个可训练参数,使用户能够利用原始音频和简短文本作为输入,驾驭音乐的全局(如流派和音色)与局部(如旋律)特征。通过客观与主观研究,我们在音色转换、流派转换和伴奏生成三项任务上评估了AP-Adapter。此外,我们还展示了其在训练时未见过乐器的域外音频上的有效性。