Music representation learning is notoriously difficult for its complex human-related concepts contained in the sequence of numerical signals. To excavate better MUsic SEquence Representation from labeled audio, we propose a novel text-supervision pre-training method, namely MUSER. MUSER adopts an audio-spectrum-text tri-modal contrastive learning framework, where the text input could be any form of meta-data with the help of text templates while the spectrum is derived from an audio sequence. Our experiments reveal that MUSER could be more flexibly adapted to downstream tasks compared with the current data-hungry pre-training method, and it only requires 0.056% of pre-training data to achieve the state-of-the-art performance.
翻译:音乐表示学习因其数值信号序列中蕴含复杂的人类相关概念而极具挑战性。为从标注音频中挖掘更优的音乐序列表示,我们提出一种新颖的文本监督预训练方法——MUSER。该方法采用音频-频谱-文本三模态对比学习框架,其中文本输入可通过文本模板支持任意形式的元数据,频谱则源自音频序列。实验表明,与当前依赖海量数据的预训练方法相比,MUSER能更灵活地适配下游任务,仅需0.056%的预训练数据即可达到最先进性能。