Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: https://lrwinr.github.io/RRPO-CosyVoice.
翻译:DiffRO等可微强化学习框架为可控文本转语音提供了强大方法,但在情感控制等精细任务中易受奖励攻击影响。策略模型可能通过生成声学伪影来利用普通奖励模型获取虚假奖励,但会以牺牲感知质量为代价。为解决此问题,我们提出鲁棒奖励策略优化框架,该框架采用混合正则化方案。该方案构建了一个鲁棒奖励模型,其奖励信号与人类感知更可靠地对齐,迫使策略放弃有害的捷径,转而学习真实情感的复杂特征。消融研究证实了我们奖励模型的增强鲁棒性,其强大的跨语言泛化能力即是明证。主观评估表明,该鲁棒奖励模型能有效缓解奖励攻击,在情感表现力和自然度方面均显著优于所有基线方法。演示页面:https://lrwinr.github.io/RRPO-CosyVoice。