We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of stems (or their combinations) composing music tracks and allows the objective evaluation of compositional models for music in the task of accompaniment generation. We also introduce a new baseline for compositional music generation called CompoNet, based on ControlNet \cite{zhang2023adding}, generalizing the tasks of MSDM, and quantify it against the latter using COCOLA. We release all models trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).
翻译:我们提出了COCOLA(面向连贯性的音频对比学习),一种用于音乐音频表征的对比学习方法,该方法能够捕捉样本之间的和声与节奏连贯性。该方法在构成音乐曲目的音轨(或其组合)层面运行,使得在伴奏生成任务中能够客观评估音乐的组合模型。我们还引入了一个名为CompoNet的作曲音乐生成新基线,该模型基于ControlNet \cite{zhang2023adding},推广了MSDM任务,并利用COCOLA对后者进行量化对比。我们发布了所有在包含独立音轨的公共数据集(MUSDB18-HQ、MoisesDB、Slakh2100和CocoChorales)上训练的模型。