In this paper, we start by training End-to-End Automatic Speech Recognition (ASR) models using Federated Learning (FL) and examining the fundamental considerations that can be pivotal in minimizing the performance gap in terms of word error rate between models trained using FL versus their centralized counterpart. Specifically, we study the effect of (i) adaptive optimizers, (ii) loss characteristics via altering Connectionist Temporal Classification (CTC) weight, (iii) model initialization through seed start, (iv) carrying over modeling setup from experiences in centralized training to FL, e.g., pre-layer or post-layer normalization, and (v) FL-specific hyperparameters, such as number of local epochs, client sampling size, and learning rate scheduler, specifically for ASR under heterogeneous data distribution. We shed light on how some optimizers work better than others via inducing smoothness. We also summarize the applicability of algorithms, trends, and propose best practices from prior works in FL (in general) toward End-to-End ASR models.
翻译:本文首先通过联邦学习(FL)训练端到端自动语音识别(ASR)模型,并探讨在最小化FL训练模型与集中式训练模型之间词错误率性能差距时可能具有关键作用的基本考量因素。具体而言,我们研究了以下因素的影响:(i)自适应优化器,(ii)通过调整连接时序分类(CTC)权重改变损失特性,(iii)通过随机种子初始化模型,(iv)将集中式训练中的建模经验迁移至FL(如前层归一化或后层归一化),以及(v)FL特有的超参数(如本地训练轮数、客户端采样规模和学习率调度策略)——尤其针对异构数据分布下的ASR任务。我们阐明了部分优化器通过诱导平滑性而优于其他优化器的内在机理,同时总结了现有FL研究(通用领域)中适用于端到端ASR模型的算法适用性、发展趋势,并提出了最佳实践建议。