Automatically generating realistic musical performance motion can greatly enhance digital media production, often involving collaboration between professionals and musicians. However, capturing the intricate body, hand, and finger movements required for accurate musical performances is challenging. Existing methods often fall short due to the complex mapping between audio and motion, typically requiring additional inputs like scores or MIDI data. In this work, we present SyncViolinist, a multi-stage end-to-end framework that generates synchronized violin performance motion solely from audio input. Our method overcomes the challenge of capturing both global and fine-grained performance features through two key modules: a bowing/fingering module and a motion generation module. The bowing/fingering module extracts detailed playing information from the audio, which the motion generation module uses to create precise, coordinated body motions reflecting the temporal granularity and nature of the violin performance. We demonstrate the effectiveness of SyncViolinist with significantly improved qualitative and quantitative results from unseen violin performance audio, outperforming state-of-the-art methods. Extensive subjective evaluations involving professional violinists further validate our approach. The code and dataset are available at https://github.com/Kakanat/SyncViolinist.
翻译:自动生成逼真的音乐演奏动作能极大增强数字媒体制作效果,该领域通常需要专业人士与音乐家的协作。然而,捕捉精确音乐演奏所需的复杂身体、手部及手指运动极具挑战性。现有方法常因音频与动作间复杂的映射关系而效果有限,通常需要乐谱或MIDI数据等额外输入。本研究提出SyncViolinist——一个仅从音频输入即可生成同步小提琴演奏动作的多阶段端到端框架。该方法通过弓法/指法模块与动作生成模块两个核心组件,克服了全局与细粒度演奏特征的捕捉难题。弓法/指法模块从音频中提取详细演奏信息,动作生成模块则利用这些信息创建精确协调的身体动作,以反映小提琴演奏的时间粒度与本质特性。我们通过未见过的演奏音频数据验证了SyncViolinist的有效性,其定性与定量结果均显著优于现有先进方法。邀请专业小提琴家参与的广泛主观评估进一步验证了本方法的优越性。代码与数据集已开源:https://github.com/Kakanat/SyncViolinist。