This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our approach in comparison with the state-of-the-art reveals highly competitive performance when the constraint of finding the precise tempo octave is relaxed.
翻译:本文研究音乐音频中全局节拍估计问题。鉴于节拍标注耗时且需要特定音乐专业知识,可供训练该任务机器学习模型的公开数据源极为有限。为缓解这一问题,我们提出一种完全自监督方法,该方法不依赖任何人工标注数据。我们的方法基于以下事实:通用(音乐)音频嵌入已编码包括节拍信息在内的多种属性,使其易于适配下游任务。尽管近期自监督节拍估计研究致力于学习节拍特定表示并随后用于训练监督分类器,但我们将任务重新定义为二分类问题——预测目标曲目与参考曲目是否具有相同或不同节拍。前者仍需要标注训练数据用于最终分类模型,而我们的方法则利用任意无标注音乐数据结合时间拉伸进行模型训练,并借助少量合成参考样本预测最终节拍。与当前最优方法的对比评估表明,在放宽精确节拍八度约束的条件下,本方法具有高度竞争力。