Pitch shifting has been an essential feature in singing voice production. However, conventional signal processing approaches exhibit well known trade offs such as formant shifts and robotic coloration that becomes more severe at larger transposition jumps. This paper targets high quality pitch shifting for singing by reframing it as a restoration problem: given an audio track that has been pitch shifted (and thus contaminated by artifacts), we recover a natural sounding performance while preserving its melody and timing. Specifically, we use a lightweight, mel space diffusion model driven by frame level acoustic features such as f0, volume, and content features. We construct training pairs in a self supervised manner by applying pitch shifts and reversing them to simulate realistic artifacts while retaining ground truth. On a curated singing set, the proposed approach substantially reduces pitch shift artifacts compared to representative classical baselines, as measured by both statistical metrics and pairwise acoustic measures. The results suggest that restoration based pitch shifting could be a viable approach towards artifact resistant transposition in vocal production workflows.
翻译:音高偏移一直是歌声制作中的关键功能。然而,传统的信号处理方法存在众所周知的权衡问题,如共振峰偏移和机械染色效应,这些问题在较大音程跨越时会变得更加严重。本文通过将音高偏移重新定义为修复问题,以实现高质量的歌声音高偏移:给定一个经过音高偏移处理(因而包含伪影)的音频轨道,我们在保持其旋律和时序的同时恢复出自然音质的演唱效果。具体而言,我们采用一个轻量级的梅尔谱空间扩散模型,该模型由帧级声学特征(如基频、音量和内容特征)驱动。我们通过施加音高偏移并反向操作来模拟真实伪影,同时保留原始参考音频,以此自监督方式构建训练数据对。在精选的歌声数据集上,通过统计指标和成对声学测量评估,所提方法相较于代表性经典基线模型显著降低了音高偏移伪影。结果表明,基于修复的音高偏移方法可能成为声乐制作工作流程中实现抗伪影音程转换的可行途径。