Many deep learning models have achieved dominant performance on the offline beat tracking task. However, online beat tracking, in which only the past and present input features are available, still remains challenging. In this paper, we propose BEAt tracking Streaming Transformer (BEAST), an online joint beat and downbeat tracking system based on the streaming Transformer. To deal with online scenarios, BEAST applies contextual block processing in the Transformer encoder. Moreover, we adopt relative positional encoding in the attention layer of the streaming Transformer encoder to capture relative timing position which is critically important information in music. Carrying out beat and downbeat experiments on benchmark datasets for a low latency scenario with maximum latency under 50 ms, BEAST achieves an F1-measure of 80.04% in beat and 52.73% in downbeat, which is a substantial improvement of about 5 and 13 percentage points over the state-of-the-art online beat and downbeat tracking model.
翻译:许多深度学习模型在离线节拍追踪任务上已取得主导性能。然而,仅依赖过去与当前输入特征的在线节拍追踪仍具挑战性。本文提出基于流式Transformer的节拍追踪模型BEAST(BEAt tracking Streaming Transformer),一种在线节拍与强拍联合追踪系统。为应对在线场景,BEAST在Transformer编码器中应用了上下文分块处理。此外,我们在流式Transformer编码器的注意力层中采用相对位置编码,以捕捉音乐中至关重要的相对时序位置信息。在最大延迟低于50毫秒的低延迟场景下,基于基准数据集进行的节拍与强拍实验表明,BEAST在节拍追踪中F1值达80.04%,强拍追踪中达52.73%,较当前最优在线节拍与强拍追踪模型分别提升了约5个和13个百分点。