ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

Speaker adaptation, which involves cloning voices from unseen speakers in the Text-to-Speech task, has garnered significant interest due to its numerous applications in multi-media fields. Despite recent advancements, existing methods often struggle with inadequate speaker representation accuracy and overfitting, particularly in limited reference speeches scenarios. To address these challenges, we propose an Agile Speaker Representation Reinforcement Learning strategy to enhance speaker similarity in speaker adaptation tasks. ASRRL is the first work to apply reinforcement learning to improve the modeling accuracy of speaker embeddings in speaker adaptation, addressing the challenge of decoupling voice content and timbre. Our approach introduces two action strategies tailored to different reference speeches scenarios. In the single-sentence scenario, a knowledge-oriented optimal routine searching RL method is employed to expedite the exploration and retrieval of refinement information on the fringe of speaker representations. In the few-sentence scenario, we utilize a dynamic RL method to adaptively fuse reference speeches, enhancing the robustness and accuracy of speaker modeling. To achieve optimal results in the target domain, a multi-scale fusion scoring mechanism based reward model that evaluates speaker similarity, speech quality, and intelligibility across three dimensions is proposed, ensuring that improvements in speaker similarity do not compromise speech quality or intelligibility. The experimental results on the LibriTTS and VCTK datasets within mainstream TTS frameworks demonstrate the extensibility and generalization capabilities of the proposed ASRRL method. The results indicate that the ASRRL method significantly outperforms traditional fine-tuning approaches, achieving higher speaker similarity and better overall speech quality with limited reference speeches.

翻译：说话人自适应旨在文本到语音任务中克隆未见说话人的声音，因其在多媒体领域的广泛应用而备受关注。尽管近期研究取得进展，现有方法仍常面临说话人表征准确性不足与过拟合问题，尤其在参考语音有限的情况下。为应对这些挑战，我们提出一种敏捷说话人表征强化学习策略，以增强说话人自适应任务中的说话人相似度。ASRRL是首个将强化学习应用于提升说话人自适应中说话人嵌入建模准确性的工作，解决了语音内容与音色解耦的难题。我们的方法针对不同参考语音场景设计了两种动作策略。在单句场景中，采用知识导向的最优路径搜索强化学习方法，以加速在说话人表征边缘探索和检索精细化信息；在多句场景中，我们利用动态强化学习方法自适应融合参考语音，提升说话人建模的鲁棒性与准确性。为实现目标域的最优效果，提出基于多尺度融合评分机制的奖励模型，从说话人相似度、语音质量和可懂度三个维度进行评估，确保说话人相似度的提升不会损害语音质量或可懂度。在主流TTS框架下于LibriTTS和VCTK数据集上的实验结果表明，所提ASRRL方法具有良好的可扩展性与泛化能力。结果显示，ASRRL方法显著优于传统微调方法，在有限参考语音条件下实现了更高的说话人相似度与更优的整体语音质量。