Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
翻译:文本转语音合成技术在中性语音上已实现接近人类的音质,但情感表现力仍是挑战。现有方法通常依赖昂贵的情感标注数据,或优化无法有效捕捉语音情感表现力与感知自然度的间接目标,导致合成语音虽准确却情感平淡。为解决这些问题,我们提出RLAIF-SPA框架,引入基于AI反馈的强化学习机制,利用自动语音识别与大型语言模型技术,分别评估语义准确性和韵律-情感标签对齐度,将其作为优化情感表现力与清晰度的直接奖励。具体而言,该框架通过韵律标签对齐技术,在结构、情感、语速、音调四个细粒度维度上联合考量语义准确性与韵律-情感对齐,从而提升表达质量。同时,引入语义准确性反馈机制以确保生成清晰准确的语音。在Libri Speech数据集上的实验表明,RLAIF-SPA优于Chat-TTS模型,词错误率降低26.1%,SIM-O指标提升9.1%,人工评估指标改善超10%。