To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems, including automatic speech recognition (ASR) and speech translation (ST). However, the commonly used FL approach (i.e., \textsc{FedAvg}) in S2T tasks typically suffers from extensive communication overhead due to multi-round interactions based on the whole model and performance degradation caused by data heterogeneity among clients.To address these issues, we propose a personalized federated S2T framework that introduces \textsc{FedLoRA}, a lightweight LoRA module for client-side tuning and interaction with the server to minimize communication overhead, and \textsc{FedMem}, a global model equipped with a $k$-nearest-neighbor ($k$NN) classifier that captures client-specific distributional shifts to achieve personalization and overcome data heterogeneity. Extensive experiments based on Conformer and Whisper backbone models on CoVoST and GigaSpeech benchmarks show that our approach significantly reduces the communication overhead on all S2T tasks and effectively personalizes the global model to overcome data heterogeneity.
翻译:为保护隐私并满足法律法规要求,联邦学习在训练语音转文本系统(包括自动语音识别与语音翻译)中受到广泛关注。然而,S2T任务中常用的联邦学习方法(即\textsc{FedAvg})通常面临两大挑战:基于完整模型的多轮交互导致大量通信开销,以及客户端数据异质性引发的性能下降。针对这些问题,我们提出了一种个人化联邦S2T框架,该框架引入\textsc{FedLoRA}(一种轻量级LoRA模块,用于客户端微调及与服务器的交互,以最小化通信开销)和\textsc{FedMem}(一种配备$k$-近邻分类器的全局模型,可捕捉客户端特定分布偏移以实现个性化并克服数据异质性)。基于Conformer和Whisper骨干模型在CoVoST和GigaSpeech基准上的大量实验表明,我们的方法显著降低了所有S2T任务中的通信开销,并有效地将全局模型个性化以克服数据异质性。