One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.
翻译:端到端自动语音识别(ASR)框架的局限性之一在于,若训练与测试语句的长度不匹配,其性能将受影响。本文针对短视频ASR任务中训练与测试语句长度不匹配问题,提出一种基于即时随机语句拼接(RUC)的数据增强方法。具体而言,我们观察到人工标注的训练语句在短视频自发性语音中往往较短(平均约3秒),而经语音活动检测前端生成的测试语句则长得多(平均约10秒)。这种长度不匹配可能导致次优性能。实验表明,所提出的RUC方法能显著提升长语句识别效果,且不造成短语句性能下降。总体而言,该方法在15种语言上平均实现5.72%的词错误率降低,并提高了对不同语句长度的鲁棒性。