The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, which views speech variance such as pitch and content as different modalities. Inspired by the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1) adopts a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is computed in a simple yet effective way and is quantized into a discrete space; and 2) uses the predicted rhythm representation to re-align the content based on cross-attention and conducts a cross-modal fusion for re-synthesize. Extensive experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://alignsts.github.io.
翻译:语音到歌声(STS)转换任务旨在根据语音录音生成对应的歌声样本,其面临的重大挑战在于:在无文本情境下,目标(歌声)音高轮廓与源(语音)内容之间的对齐关系难以学习。本文提出AlignSTS——一种基于显式跨模态对齐的STS模型,该模型将音高、内容等语音变体视为不同模态。受人类如何依据旋律演唱歌词机制的启发,AlignSTS:1)采用新型节奏适配器预测目标节奏表示以弥合内容与音高之间的模态鸿沟,其中节奏表示通过简单高效的方式计算并被量化为离散空间;2)利用预测的节奏表示基于交叉注意力机制重新对齐内容,并通过跨模态融合进行重建。大量实验表明,AlignSTS在客观与主观指标上均取得了优越性能。音频样本可访问 https://alignsts.github.io 获取。