Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: https://realfolkcode.github.io/pianokontext_demo/.
翻译:富有表现力的演奏渲染旨在生成受音符序列约束的真实演奏。然而,流匹配音频编辑模型仅能处理时长相同的同步音乐样本,限制了其对表现性时值的理解。我们提出PianoKontext——一种面向古典钢琴音乐的流匹配渲染模型,该模型在预训练Music2Latent模型的潜在空间中生成可变长度演奏。我们将MIDI乐谱合成为无表情音频,并在潜在空间中运用动态时间规整构建用于训练的配对数据。对齐后的嵌入在DiT模块中进行拼接,从而简单有效地学习乐谱与演奏之间的依赖关系。音频样本请访问我们的演示页面:https://realfolkcode.github.io/pianokontext_demo/。