The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, which views speech variance such as pitch and content as different modalities. Inspired by the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1) adopts a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is computed in a simple yet effective way and is quantized into a discrete space; and 2) uses the predicted rhythm representation to re-align the content based on cross-attention and conducts a cross-modal fusion for re-synthesize. Extensive experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://alignsts.github.io.
翻译:语音到歌声(STS)转换任务旨在生成与语音录音相对应的歌声样本,但面临一个主要挑战:在无文本情境下,目标(歌声)音高轮廓与源(语音)内容之间的对齐难以学习。本文提出AlignSTS,一种基于显式跨模态对齐的STS模型,将音高、内容等语音变化视为不同模态。受人类如何将歌词唱配旋律机制的启发,AlignSTS:1)采用新颖的节奏适配器预测目标节奏表征,以弥合内容与音高之间的模态鸿沟,其中节奏表征以简单而有效的方式计算,并量化为离散空间;2)利用预测的节奏表征基于交叉注意力重新对齐内容,并执行跨模态融合以重构语音。大量实验表明,AlignSTS在客观与主观指标上均实现了优异性能。音频样本见https://alignsts.github.io。