This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
翻译:本文提出了一种基于语音韵律的说话人嵌入方法,通过目标说话人的少量语句对音素时长进行建模。在语音合成中,语音韵律与F0等声学特征共同构成说话人特性的核心要素,是复现个体语音产出的关键。该方法的核心创新在于从音素及其时长中提取韵律嵌入——已知音素及其时长与说话节奏密切相关,其提取过程采用与传统频谱特征类似的说话人识别模型。为评估性能,我们开展了三项实验:说话人嵌入生成、基于生成嵌入的语音合成及嵌入空间分析。实验表明,即便仅使用音素及其时长信息,该方法仍能取得中等水平的说话人识别性能(15.2%等错误率)。主客观评测结果显示,相较于传统方法,该方法合成的语音在韵律上更接近目标说话人。我们进一步通过嵌入可视化分析嵌入距离与感知相似度的关联,嵌入空间分布与邻近性关系分析表明,嵌入向量分布能有效反映主客观相似度。