Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task and system dependent, together with additionally introduced minimum and maximum cut-offs of these time-frequency masks, are now automatically learned using an RNN-based policy controller and tightly integrated with ASR system training. Experiments on the UASpeech corpus suggest the proposed RL-based data augmentation approach consistently produced performance superior or comparable that obtained using expert or handcrafted SpecAugment policies. Our RL auto-augmented PyChain TDNN system produced an overall WER of 28.79% on the UASpeech test set of 16 dysarthric speakers.
翻译:构音障碍语音的自动识别因数据稀缺至今仍是一项极具挑战的任务。本文提出一种基于强化学习(RL)的实时数据增强方法,用于训练处理此类数据的先进PyChain TDNN和端到端Conformer ASR系统。标准SpecAugment方法中依赖于任务和系统的手工设计时域和频域掩码操作,以及额外引入的时频掩码最小和最大截断阈值,现通过基于RNN的策略控制器自动学习,并与ASR系统训练紧密结合。在UASpeech语料库上的实验表明,所提出的基于RL的数据增强方法始终能产生优于或可媲美专家设计或手工SpecAugment策略的性能。我们的RL自动增强PyChain TDNN系统在包含16位构音障碍说话者的UASpeech测试集上实现了28.79%的总体词错误率(WER)。