细粒度偏好优化提升零样本文本到语音合成性能 (Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech)

Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of audio samples, while other segments are well-generated. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO focuses on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. Specifically, we first analyze the types of issues in generated samples, categorize them into two groups, and propose a selective training loss strategy to optimize preferences based on fine-grained labels for each issue type. Experimental results show that FPO enhances the robustness of zero-shot TTS systems by effectively addressing local issues, significantly reducing the bad case ratio, and improving intelligibility. Furthermore, FPO exhibits superior data efficiency compared with baseline systems, achieving similar performance with fewer training samples.

翻译：通过整合人类反馈来对齐文本转语音（TTS）系统输出与人类偏好，已被证明是增强基于语言模型的TTS系统鲁棒性的有效方法。现有方法主要侧重于使用在话语层面标注的偏好数据。然而，影响听感体验的常见问题往往仅出现在音频样本的特定片段中，而其他片段则生成良好。在本研究中，我们提出了一种细粒度偏好优化方法（FPO）以增强TTS系统的鲁棒性。FPO专注于解决生成样本中的局部问题，而非均匀优化整个话语。具体而言，我们首先分析生成样本中的问题类型，将其分为两类，并提出一种选择性训练损失策略，以基于每种问题类型的细粒度标签来优化偏好。实验结果表明，FPO通过有效解决局部问题，显著降低了不良案例比例并提升了可懂度，从而增强了零样本TTS系统的鲁棒性。此外，与基线系统相比，FPO展现出更优的数据效率，能够以更少的训练样本实现相近的性能。