We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
翻译:我们证明了利用单一基础模型的中间表示来增强各种下游音乐任务的有效性。我们提出了SoniDo,这是一种旨在从目标音乐样本中提取层级特征的音乐基础模型。通过利用层级化的中间特征,SoniDo约束了信息粒度,从而在包括理解型和生成型任务在内的多种下游任务中提升了性能。我们特别在音乐标签分类、音乐转录、音乐源分离和音乐混音等代表性任务上评估了该方法。我们的结果表明,从基础模型提取的特征为下游任务模型的训练提供了有价值的增强。这突显了利用从音乐基础模型提取的特征作为下游任务增强器的能力。我们的方法不仅有益于现有的任务特定模型,也支持受数据稀缺约束的下游音乐任务。这为开发更高效、更易获取的音乐处理解决方案铺平了道路。