Self-supervised learning (SSL) methods for large speech models have proven to be highly effective at ASR. With the interest in public deployment of large pre-trained models, there is a rising concern for unintended memorization and leakage of sensitive data points from the training data. In this paper, we apply differentially private (DP) pre-training to a SOTA Conformer-based encoder, and study its performance on a downstream ASR task assuming the fine-tuning data is public. This paper is the first to apply DP to SSL for ASR, investigating the DP noise tolerance of the BEST-RQ pre-training method. Notably, we introduce a novel variant of model pruning called gradient-based layer freezing that provides strong improvements in privacy-utility-compute trade-offs. Our approach yields a LibriSpeech test-clean/other WER (%) of 3.78/ 8.41 with ($10$, 1e^-9)-DP for extrapolation towards low dataset scales, and 2.81/ 5.89 with (10, 7.9e^-11)-DP for extrapolation towards high scales.
翻译:用于大规模语音模型的自监督学习方法已被证明在自动语音识别任务中极为有效。随着大规模预训练模型公开部署的需求日益增长,人们越来越关注训练数据中敏感信息点的意外记忆与泄露风险。本文首次将差分隐私预训练应用于基于Conformer架构的先进编码器,并假设微调数据为公开数据,研究其在下游自动语音识别任务中的性能表现。这是首次将差分隐私应用于自动语音识别的自监督学习,探究了BEST-RQ预训练方法的差分隐私噪声容忍度。值得注意的是,我们提出了一种新颖的模型剪枝变体——基于梯度的层冻结方法,该方法在隐私保护、模型效用与计算成本的权衡中实现了显著改进。我们的方法在面向低数据规模外推时,以($10$, 1e^-9)-差分隐私实现了LibriSpeech测试集clean/other子集3.78%/8.41%的词错误率;在面向高数据规模外推时,以(10, 7.9e^-11)-差分隐私实现了2.81%/5.89%的词错误率。