We enhance the vanilla adversarial training method for unsupervised Automatic Speech Recognition (ASR) by a diffusion-GAN. Our model (1) injects instance noises of various intensities to the generator's output and unlabeled reference text which are sampled from pretrained phoneme language models with a length constraint, (2) asks diffusion timestep-dependent discriminators to separate them, and (3) back-propagates the gradients to update the generator. Word/phoneme error rate comparisons with wav2vec-U under Librispeech (3.1% for test-clean and 5.6% for test-other), TIMIT and MLS datasets, show that our enhancement strategies work effectively.
翻译:我们通过引入扩散生成对抗网络(diffusion-GAN)来增强非监督自动语音识别(ASR)中朴素对抗训练方法的性能。模型的核心机制包括:(1)向生成器输出及从预训练音素语言模型(经长度约束采样)得到的无标注参考文本注入不同强度的实例噪声;(2)利用依赖扩散时间步的判别器对上述注入噪声后的样本进行区分;(3)通过反向传播梯度实现生成器的参数更新。在Librispeech(测试清洁集3.1%词/音素错误率、测试其他集5.6%)、TIMIT及MLS数据集上的词/音素错误率对比表明,我们所提出的增强策略具有显著有效性。