In healthy-to-pathological voice conversion (H2P-VC), healthy speech is converted into pathological while preserving the identity. The paper improves on previous two-stage approach to H2P-VC where (1) speech is created first with the appropriate severity, (2) then the speaker identity of the voice is converted while preserving the severity of the voice. Specifically, we propose improvements to (2) by using phonetic posteriorgrams (PPG) and global style tokens (GST). Furthermore, we present a new dataset that contains parallel recordings of pathological and healthy speakers with the same identity which allows more precise evaluation. Listening tests by expert listeners show that the framework preserves severity of the source sample, while modelling target speaker's voice. We also show that (a) pathology impacts x-vectors but not all speaker information is lost, (b) choosing source speakers based on severity labels alone is insufficient.
翻译:在健康到病理语音转换(H2P-VC)中,健康语音被转换为病理语音,同时保留说话人身份。本文改进了先前两阶段H2P-VC方法:(1)首先生成具有适当严重程度的语音;(2)在保持语音严重程度的同时转换说话人身份。具体而言,我们提出通过使用音素后验图(PPG)和全局风格标记(GST)来改进阶段(2)。此外,我们引入了一个包含相同身份病理与健康说话人并行录音的新数据集,从而能够进行更精确的评估。专家听众的听力测试表明,该框架在建模目标说话人声音的同时,保留了源样本的严重程度。我们还发现:(a)病理会影响x-向量,但并非所有说话人信息都会丢失;(b)仅基于严重程度标签选择源说话人是不可行的。