Capturing intricate and subtle variations in human expressiveness in music performance using computational approaches is challenging. In this paper, we propose a novel approach for reconstructing human expressiveness in piano performance with a multi-layer bi-directional Transformer encoder. To address the needs for large amounts of accurately captured and score-aligned performance data in training neural networks, we use transcribed scores obtained from an existing transcription model to train our model. We integrate pianist identities to control the sampling process and explore the ability of our system to model variations in expressiveness for different pianists. The system is evaluated through statistical analysis of generated expressive performances and a listening test. Overall, the results suggest that our method achieves state-of-the-art in generating human-like piano performances from transcribed scores, while fully and consistently reconstructing human expressiveness poses further challenges.
翻译:使用计算方法捕捉音乐表演中人类表现力的细微与复杂变化极具挑战性。本文提出了一种新颖方法,通过多层双向Transformer编码器重构钢琴演奏中的表现力。为解决神经网络训练所需的大规模精准捕捉且与乐谱对齐的表演数据问题,我们采用现有转录模型生成的转录乐谱来训练模型。通过整合钢琴家身份信息控制采样过程,探索系统对不同钢琴家表现力变化的建模能力。通过生成性表演的统计分析及听觉测试评估系统性能。总体结果表明,我们的方法在从转录乐谱生成类人钢琴表演方面达到了当前最优水平,但完全且一致地重构人类表现力仍面临进一步挑战。