In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community.
翻译:本工作研究在少样本场景下对文本生成音乐扩散模型进行个性化定制。受计算机视觉领域最新进展的启发,我们首次探索将预训练文本生成音频扩散模型与两种成熟的个性化方法相结合。我们实验研究了音频特定数据增强对系统整体性能的影响,并评估了不同的训练策略。为进行评估,我们构建了一个包含提示词和音乐片段的新数据集。在定量评估方面,我们综合考虑了基于嵌入和音乐特定指标,同时通过用户研究进行定性评估。分析表明:相似度指标与用户偏好保持一致;当前个性化方法较旋律而言更易学习节奏性音乐结构。本研究涉及的代码、数据集及示例材料均已向研究社区开放。