Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results.
翻译:小提琴演奏的表现性音乐合成(EMS)是一项具有挑战性的任务,这源于音乐演奏者对表现性音乐术语(EMT)诠释的分歧、带标注录音的稀缺性以及合成模型有限的泛化能力。这些挑战在模型有效性、生成结果的多样性和合成系统的可控性之间形成了权衡,因此对EMS模型设计进行比较研究至关重要。本文探讨了两种小提琴EMS方法。端到端方法是对一种先进的文本到语音生成器的改进。参数控制方法基于一个简单的参数采样过程,该过程可以渲染与MIDI-DDSP兼容的音符长度和其他参数。我们通过客观和主观实验研究了这两种方法(共三种模型变体),并根据结果讨论了EMS的几个关键问题。