Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.
翻译:近期的节拍与强拍跟踪模型(如RNN、TCN、Transformer)通常输出帧级激活。本文提出将该任务重新定义为目标检测问题,将节拍与强拍建模为时序"对象"。通过将计算机视觉中的FCOS检测器适配至一维音频信号,我们将其原始主干网络替换为WaveBeat的时序特征提取器,并引入特征金字塔网络以捕捉多尺度时序模式。该模型预测带有置信度分数的重叠节拍/强拍区间,随后通过非极大值抑制(NMS)选择最终预测结果。此NMS步骤在功能上类似于传统跟踪器中的动态贝叶斯网络,但更为简洁且减少启发式设计。在标准音乐数据集上的评估表明,该方法取得了具有竞争力的结果,证明目标检测技术只需最小化适配即可有效建模音乐节拍。