Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic-an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody. Code and checkpoints: https://github.com/bshall/urhythmic. Audio demo page: https://ubisoft-laforge.github.io/speech/urhythmic.
翻译:语音转换旨在将源语音转换为不同的目标语音。然而,典型的语音转换系统并未考虑节奏,而节奏是感知说话人身份的重要因素。为弥补这一不足,我们提出了Urhythmic——一种无需平行数据或文本转录的无监督节奏转换方法。利用自监督表征,我们首先将源音频分割为近似响音、阻塞音和静音的音段。随后,通过估算每种音段类型的语速或时长分布来建模节奏。最后,通过时间拉伸语音段匹配目标语速或节奏。实验表明,Urhythmic在音质和韵律方面优于现有无监督方法。代码与模型:https://github.com/bshall/urhythmic。音频演示页面:https://ubisoft-laforge.github.io/speech/urhythmic。