Automated piano performance evaluation traditionally relies on symbolic (MIDI) representations, which capture note-level information but miss the acoustic nuances that characterize expressive playing. I propose using pre-trained audio foundation models, specifically MuQ and MERT, to predict 19 perceptual dimensions of piano performance quality. Using synthesized audio from PercePiano MIDI files (rendered via Pianoteq), I compare audio and symbolic approaches under controlled conditions where both derive from identical source data. The best model, MuQ layers 9-12 with Pianoteq soundfont augmentation, achieves R^2 = 0.537 (95% CI: [0.465, 0.575]), representing a 55% improvement over the symbolic baseline (R^2 = 0.347). Statistical analysis confirms significance (p < 10^-25) with audio outperforming symbolic on all 19 dimensions. I validate the approach through cross-soundfont generalization (R^2 = 0.534 +/- 0.075), difficulty correlation with an external dataset (rho = 0.623), and multi-performer consistency analysis. Analysis of audio-symbolic fusion reveals high error correlation (r = 0.738), explaining why fusion provides minimal benefit: audio representations alone are sufficient. I release the complete training pipeline, pretrained models, and inference code.
翻译:传统的自动化钢琴演奏评估通常依赖于符号化(MIDI)表征,这类表征能捕捉音符层级的信息,却遗漏了表征表现力演奏的声学细微差别。本文提出使用预训练的音频基础模型,特别是MuQ和MERT,来预测钢琴演奏质量的19个感知维度。利用从PercePiano MIDI文件(通过Pianoteq渲染)合成的音频,我在受控条件下比较了音频与符号化方法,两者均源自相同的数据源。最佳模型——使用Pianoteq音色库增强的MuQ第9至12层——达到了R^2 = 0.537(95%置信区间:[0.465, 0.575]),相较于符号化基线(R^2 = 0.347)提升了55%。统计分析证实了其显著性(p < 10^-25),音频模型在所有19个维度上均优于符号化模型。我通过跨音色库泛化能力(R^2 = 0.534 +/- 0.075)、与外部数据集的难度相关性(rho = 0.623)以及多演奏者一致性分析验证了该方法。对音频-符号化融合的分析揭示了较高的误差相关性(r = 0.738),这解释了为何融合带来的收益微乎其微:单独的音频表征已足够充分。我发布了完整的训练流程、预训练模型及推理代码。