This paper presents a user-driven approach for synthesizing specific target voices based on user feedback rather than reference recordings, which is particularly beneficial for speech-impaired individuals who want to recreate their lost voices but lack prior recordings. Our method leverages the neural analysis and synthesis framework to construct a latent speaker embedding space. Within this latent space, a human-in-the-loop search algorithm guides the voice generation process. Users participate in a series of straightforward listening-and-comparison tasks, providing feedback that iteratively refines the synthesized voice to match their desired target. Both computer simulations and real-world user studies demonstrate that the proposed approach can effectively approximate target voices. Moreover, by analyzing the mel-spectrogram generator's Jacobians, we identify a set of meaningful voice editing directions within the latent space. These directions enable users to further fine-tune specific attributes of the generated voice, including the pitch level, pitch range, volume, vocal tension, nasality, and tone color.
翻译:本文提出了一种基于用户反馈而非参考录音的特定目标语音合成方法,该方法对于希望重建已丧失嗓音但缺乏历史录音的言语障碍者尤为有益。我们的方法利用神经分析与合成框架构建潜在说话人嵌入空间。在该潜在空间中,采用人机协同搜索算法引导语音生成过程。用户通过参与一系列简单的听音比较任务,提供反馈以迭代优化合成语音,使其逐步逼近期望目标。计算机仿真与真实用户研究均表明,所提方法能有效逼近目标语音。此外,通过分析梅尔频谱生成器的雅可比矩阵,我们在潜在空间中识别出一组具有语义意义的语音编辑方向。这些方向使用户能够进一步微调生成语音的特定属性,包括基频水平、基频范围、音量、声带紧张度、鼻音度与音色。