Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy. We study the problem of federated continual incremental learning for recurrent neural network-transducer (RNN-T) ASR models in the privacy-enhancing scheme of learning on-device, without access to ground truth human transcripts or machine transcriptions from a stronger ASR model. In particular, we study the performance of a self-learning based scheme, with a paired teacher model updated through an exponential moving average of ASR models. Further, we propose using possibly noisy weak-supervision signals such as feedback scores and natural language understanding semantics determined from user behavior across multiple turns in a session of interactions with the conversational agent. These signals are leveraged in a multi-task policy-gradient training approach to improve the performance of self-learning for ASR. Finally, we show how catastrophic forgetting can be mitigated by combining on-device learning with a memory-replay approach using selected historical datasets. These innovations allow for 10% relative improvement in WER on new use cases with minimal degradation on other test sets in the absence of strong-supervision signals such as ground-truth transcriptions.
翻译:自动语音识别(ASR)模型正越来越多地被部署在对话代理的边缘设备上,以增强隐私保护,这些模型具有低计算开销的特点。我们研究了在隐私增强的端侧学习场景下,针对循环神经网络-换能器(RNN-T)ASR模型的联邦持续增量学习问题,该场景无法获取人工真实转录或来自更强ASR模型的机器转录。具体而言,我们探讨了一种基于自学习的方案,其中采用通过ASR模型指数移动平均更新的配对教师模型。此外,我们提出利用可能含有噪声的弱监督信号,例如反馈分数以及基于用户在会话代理交互会话中多轮行为确定的自然语言理解语义。这些信号通过多任务策略梯度训练方法加以利用,以提升ASR自学习的性能。最后,我们展示了如何通过将端侧学习与使用选定历史数据集的内存重放方法相结合来缓解灾难性遗忘。这些创新使得在新用例上的词错误率(WER)相对提升10%,同时在没有强监督信号(如真实转录)的情况下,对其他测试集的性能退化最小。