While federated learning (FL) has recently emerged as a promising approach to train machine learning models, it is limited to only preliminary explorations in the domain of automatic speech recognition (ASR). Moreover, FL does not inherently guarantee user privacy and requires the use of differential privacy (DP) for robust privacy guarantees. However, we are not aware of prior work on applying DP to FL for ASR. In this paper, we aim to bridge this research gap by formulating an ASR benchmark for FL with DP and establishing the first baselines. First, we extend the existing research on FL for ASR by exploring different aspects of recent $\textit{large end-to-end transformer models}$: architecture design, seed models, data heterogeneity, domain shift, and impact of cohort size. With a $\textit{practical}$ number of central aggregations we are able to train $\textbf{FL models}$ that are \textbf{nearly optimal} even with heterogeneous data, a seed model from another domain, or no pre-trained seed model. Second, we apply DP to FL for ASR, which is non-trivial since DP noise severely affects model training, especially for large transformer models, due to highly imbalanced gradients in the attention block. We counteract the adverse effect of DP noise by reviving per-layer clipping and explaining why its effect is more apparent in our case than in the prior work. Remarkably, we achieve user-level ($7.2$, $10^{-9}$)-$\textbf{DP}$ (resp. ($4.5$, $10^{-9}$)-$\textbf{DP}$) with a 1.3% (resp. 4.6%) absolute drop in the word error rate for extrapolation to high (resp. low) population scale for $\textbf{FL with DP in ASR}$.
翻译:尽管联邦学习(FL)近期已成为训练机器学习模型的一种有前景的方法,但其在自动语音识别(ASR)领域的探索仍仅限于初步阶段。此外,联邦学习本身无法保证用户隐私,需要借助差分隐私(DP)来实现稳健的隐私保障。然而,我们尚未发现将差分隐私应用于语音识别联邦学习的先前工作。本文旨在通过构建面向联邦学习与差分隐私的ASR基准并建立首批基线,填补这一研究空白。首先,我们从多个维度拓展了现有FL用于ASR的研究:包括近年$\textit{大型端到端Transformer模型}$的架构设计、种子模型、数据异质性、领域偏移及群组规模的影响。在$\textit{实用}$数量的中心聚合次数下,我们能够训练出即使在数据异质性、使用其他领域种子模型或缺乏预训练种子模型时仍$\textbf{近乎最优}$的$\textbf{FL模型}$。其次,我们将DP应用于ASR的FL中——这具有非平凡挑战性,因为DP噪声会严重干扰模型训练(尤其对大型Transformer模型),其根源在于注意力块中的梯度高度不平衡。我们通过复兴逐层裁剪技术来抵消DP噪声的负面影响,并解释为何在我们场景中该技术的效果比先前工作更为显著。值得注意的是,我们实现了针对$\textbf{面向ASR的FL与DP}$在推断高人口规模(低人口规模)时词错误率绝对下降1.3%(4.6%)的用户级(7.2, $10^{-9}$)-$\textbf{DP}$(对应(4.5, $10^{-9}$)-$\textbf{DP}$)。