The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed time intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. This paper introduces a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only structured non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.
翻译:神经半马尔可夫条件随机场框架已在基于事件的钢琴转录任务中展现出潜力。在此框架中,所有事件(音符或踏板)均表示为与特定事件类型绑定的闭合时间区间。神经半马尔可夫条件随机场方法需要一个区间评分矩阵,该矩阵为每个候选区间分配评分值。然而,设计一种高效且表达能力强的区间评分架构并非易事。本文提出一种使用缩放内积运算进行区间评分的简易方法,其原理类似于Transformer中的注意力评分机制。我们从理论上证明,由于编码非重叠区间产生的特殊结构,在温和条件下,内积运算足以表达能够产生正确转录结果的理想评分矩阵。随后我们论证,仅基于低时间分辨率特征图运行的、仅含编码器结构的非层级Transformer主干网络,能够以高精度和时间准确度转录钢琴音符与踏板。实验表明,在Maestro数据集上,我们的方法在所有子任务的F1指标上均达到了新的最优性能。