Code-switching, also called code-mixing, is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance. Due to its spontaneous nature, code-switching is extremely low-resource, which makes it a challenging problem for language and speech processing tasks. In such contexts, Code-Switching Language Identification (CSLID) becomes a difficult but necessary task if we want to maximally leverage existing monolingual tools for other tasks. In this work, we propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset. Our methods include a stacked Residual CNN+GRU model and a multitask pre-training approach to use Automatic Speech Recognition (ASR) as an auxiliary task for CSLID. Due to the low-resource nature of code-switching, we also employ careful silver data creation using monolingual corpora in both languages and up-sampling as data augmentation. We focus on English-Mandarin code-switched data, but our method works on any language pair. Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
翻译:代码混合,亦称语码转换,是多语者在非正式场合于同一话语中混合不同语言词汇的语言学现象。由于其自发性,代码混合资源极其匮乏,这为语言和语音处理任务带来了挑战。在此背景下,代码混合语言识别(CSLID)成为一项困难但必要的任务,以便最大化利用现有单语工具处理其他任务。本研究针对英汉儿童导向语音数据集,提出了两种提升语言识别准确率的新方法:包括堆叠残差CNN+GRU模型,以及利用自动语音识别(ASR)作为辅助任务的多任务预训练方法。鉴于代码混合的低资源特性,我们还利用两种语言的单语语料库进行了谨慎的银数据创建,并采用上采样作为数据增强手段。我们专注于英汉代码混合数据,但该方法适用于任意语言对。在真实英汉代码混合儿童导向语音语料库上,我们最优模型达到了0.781的平衡准确率,较此前基线提升55.3%。