Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. While today's neural end-to-end (E2E) models deliver state-of-the-art performances on the task of automatic speech recognition (ASR) it is commonly known that these systems are very data-intensive. However, there is only a few transcribed and aligned CS speech available. To overcome this problem and train multilingual systems which can transcribe CS speech, we propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are concatenated. By using this training data, our E2E model improves on transcribing CS speech. It also surpasses monolingual models on monolingual tests. The results show that this augmentation technique can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
翻译:代码切换(CS)是指在不同语言间交替使用词汇和短语的现象。尽管当前神经端到端(E2E)模型在自动语音识别(ASR)任务中表现出最先进的性能,但众所周知,这些系统对数据需求极大。然而,目前可用的经过转录和对齐的CS语音数据十分有限。为克服这一问题并训练能够转录CS语音的多语言系统,我们提出了一种简单而有效的数据增强方法,即将不同源语言的音频及对应标注进行拼接。通过使用此类训练数据,我们的E2E模型在转录CS语音方面得到改进,同时在单语言测试中甚至超越了单语模型。结果表明,该增强技术可将模型在训练中未见过的句间语言切换场景下的词错误率(WER)降低5.03%。