Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.
翻译:文本到语音(TTS)中的语音克隆旨在利用目标说话人的有限数据,从文本生成富有表现力且个性化的语音。联邦学习(FL)为此任务提供了一个协作且保护隐私的框架,但现有方法存在通信成本高的问题,并且倾向于抑制风格异质性,导致个性化不足。为解决这些问题,我们提出了Fed-PISA,即联邦个性化身份-风格适应。为最小化通信成本,Fed-PISA引入了一种解耦的低秩适应(LoRA)机制:说话者的音色通过私有的ID-LoRA在本地保留,而仅将轻量级的风格-LoRA传输至服务器,从而最小化参数交换。为利用异质性,我们引入了受协同过滤启发的聚合方法,通过学习风格相似的客户端,为每个客户端创建定制模型。实验表明,Fed-PISA在极低的通信成本下,提升了风格表现力、自然度和说话人相似度,其性能优于标准的联邦基线方法。