In this paper, we aim to create weak alignment supervision to aid the end-to-end modeling. Towards this end, we use the existing hybrid ASR system to produce triphone alignments of the training audios. We then create a cross-entropy loss at a certain layer of the encoder using the derived alignments. In contrast to the general one-hot cross-entropy losses with or without loss weighting, here we use a cross-entropy loss with a label smoothing parameter to regularize the supervision. As a comparison, we also conduct the experiments with one-hot cross-entropy losses and CTC losses with loss weighting. The results show that placing the weak alignment supervision with the label smoothing parameter of 0.5 at the third encoder layer outperforms the other two approaches and leads to about 5% relative WER reduction on the TED-LIUM 2 dataset over the baseline. We see similar improvements when applying the method out-of-the-box on a Tagalog end-to-end ASR system.
翻译:本文旨在生成弱对齐监督以辅助端到端建模。为此,我们利用现有的混合语音识别系统生成训练音频的三音子对齐结果,并在编码器特定层基于该对齐结果构建交叉熵损失函数。与常规带/不带损失权重的一热交叉熵损失不同,本文采用带标签平滑参数的交叉熵损失来正则化监督信号。作为对比,我们还开展了带损失权重的一热交叉熵损失以及CTC损失的实验。结果表明,在编码器第三层设置标签平滑参数为0.5的弱对齐监督时,其性能优于另外两种方法,并在TED-LIUM 2数据集上相较于基线模型实现了约5%的相对词错误率降低。将该方法直接应用于他加禄语端到端语音识别系统时,也观察到了类似的性能提升。