Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.
翻译:构音障碍患者的语音识别面临巨大挑战。虽然基于联邦学习的ASR(自动语音识别)是保护隐私的有效工具,但用户个体差异引发的异构性问题严重制约了其性能。在异构条件下强制所有用户共享相同模型参数会导致系统性能欠佳,这使得个性化方向具有重要研究价值;然而目前针对构音障碍语音的相关研究仍然有限。为此,本文探索了两种实现个性化的聚合策略,包括基于参数的均值策略与基于嵌入的均值策略。在UASpeech和TORGO数据集上的实验表明,与标准正则化FedAvg相比,所提方法在统计显著性上分别实现了UASpeech上0.99%绝对(3.15%相对)和TORGO上0.56%绝对(4.73%相对)的词错误率显著降低。