We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
翻译:我们研究了文本到语音合成中人机交互控制韵律生成的问题。控制韵律极具挑战性,因为现有生成模型缺乏允许用户快速精准修改输出的高效接口。为解决此问题,我们提出了一种新框架,用户提供部分输入后,生成模型自动补全缺失特征。我们设计了一种专门编码部分韵律特征并输出完整音频的模型。实验表明,我们的模型具备人机交互控制机制的两大核心特质:高效性与鲁棒性。即使仅使用极少量的输入值(约4个),该模型也能显著提升输出质量——听众偏好度达到4:1。