LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces LadderSym, a novel Transformer-based method for music error detection. LadderSym is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, LadderSym introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the MAESTRO-E and CocoChorales-E datasets by measuring the F1 score for each note category. Compared to the previous state of the art, LadderSym more than doubles F1 for missed notes on MAESTRO-E (26.8% -> 56.3%) and improves extra note detection by 14.4 points (72.0% -> 86.4%). Similar gains are observed on CocoChorales-E. Furthermore, we also evaluate our models on real data we curated. This work introduces insights about comparison models that could inform sequence evaluation tasks for reinforcement learning, human skill assessment, and model evaluation. Code: https://github.com/ben2002chou/LadderSYM

翻译：音乐学习者能够从准确检测其练习错误的工具中受益匪浅。现有方法通常使用启发式规则或可学习模型将音频录音与乐谱进行比较。本文介绍了LadderSym，一种基于Transformer的新型音乐错误检测方法。LadderSym的构建受到对现有最先进方法的两个关键观察的指导：(1) 晚期融合限制了流间对齐和跨模态比较能力；(2) 对乐谱音频的依赖引入了频谱模糊性，降低了包含并发音符的音乐的性能。为应对这些局限，LadderSym引入了：(1) 配备流间对齐模块的双流编码器，以提升音频比较能力和错误检测F1分数；(2) 一种多模态策略，通过将符号化表示作为解码器提示来同时利用音频和符号化乐谱，从而减少模糊性并提高F1分数。我们在MAESTRO-E和CocoChorales-E数据集上通过测量每个音符类别的F1分数来评估我们的方法。与先前的最先进方法相比，LadderSym在MAESTRO-E上将漏音符的F1分数提高了一倍以上（26.8% -> 56.3%），并将额外音符检测的F1分数提升了14.4个百分点（72.0% -> 86.4%）。在CocoChorales-E上也观察到了类似的提升。此外，我们还在自己整理的真实数据上评估了我们的模型。这项工作提出了关于比较模型的见解，这些见解可为强化学习、人类技能评估和模型评估中的序列评价任务提供参考。代码：https://github.com/ben2002chou/LadderSYM