In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the need for a reward model or pairwise preference data, and significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions. Our experimental results verify that RIO can effectively improve both subjective and objective metrics, including mean opinion scores, word error rates, and speaker similarity. Remarkably, RIO can also diminish the incidence of bad outputs to nearly zero percent, rivalling the robustness when using ground-truth speech as the prompt.
翻译:本文提出逆向推理优化(RIO),这是一种简单而有效的方法,旨在利用基于人类反馈的强化学习(RLHF)来增强基于自回归模型的零样本文本到语音(TTS)系统的鲁棒性。为了在没有人工标注的情况下评估TTS系统生成的语音质量,RIO引入了一种基于贝叶斯原理的新概念,称为逆向推理,其核心思想是:高质量的生成语音应能作为使用同一TTS模型进行后续生成的提示。通过将逆向推理作为标准,从TTS系统自身生成的语音样本中选择用于RLHF的范例,RIO引导后续优化朝着增强TTS鲁棒性的方向发展。RIO框架包含采样、自动标注和学习三个部分,无需奖励模型或成对偏好数据,并通过减少训练与推理条件之间的差异,显著提高了零样本TTS性能的稳定性。我们的实验结果证实,RIO能有效改善主观和客观指标,包括平均意见得分、词错误率和说话人相似度。值得注意的是,RIO还能将不良输出的发生率降低至接近零,其鲁棒性可与使用真实语音作为提示时相媲美。