This paper describes a human-in-the-loop approach to personalized voice synthesis in the absence of reference speech data from the target speaker. It is intended to help vocally disabled individuals restore their lost voices without requiring any prior recordings. The proposed approach leverages a learned speaker embedding space. Starting from an initial voice, users iteratively refine the speaker embedding parameters through a coordinate descent-like process, guided by auditory perception. By analyzing the latent space, it is noted that that the embedding parameters correspond to perceptual voice attributes, including pitch, vocal tension, brightness, and nasality, making the search process intuitive. Computer simulations and real-world user studies demonstrate that the proposed approach is effective in approximating target voices across a diverse range of test cases.
翻译:本文提出了一种在缺乏目标说话人参考语音数据的情况下实现个性化语音合成的人机协同方法。该方法旨在帮助发声障碍患者恢复其丧失的嗓音,且无需任何先前的录音数据。所提出的方法利用学习得到的说话人嵌入空间,从初始语音出发,用户通过听觉感知引导的类坐标下降过程迭代优化说话人嵌入参数。通过对潜在空间的分析,我们发现嵌入参数对应着包括音高、声带张力、明亮度和鼻音度在内的感知语音属性,这使得搜索过程具有直观性。计算机仿真和真实用户研究表明,所提出的方法在多样化测试案例中能有效逼近目标嗓音。