Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.
翻译:数据稀疏性是阻碍代码转换(CS)自然语言处理系统发展的主要问题。本文研究了用于合成方言阿拉伯语-英语代码转换文本的数据增强技术。我们利用词对齐的平行语料库进行词汇替换,其中代码转换点要么随机选择,要么通过序列到序列模型学习获得。我们将这些方法与基于词典的替换进行了比较。通过人工评估对生成句子的质量进行评价,并在机器翻译(MT)、自动语音识别(ASR)和语音翻译(ST)任务上评估了数据增强的有效性。结果表明,使用预测模型生成的CS句子比随机方法更自然,这与人工评价结果一致。在下游任务中,尽管随机方法生成了更多数据,但两种方法表现相当(均优于基于词典的替换)。总体而言,与未使用增强的可用数据训练的基线相比,数据增强在困惑度上实现了34%的提升,ASR任务的WER相对改善5.2%,MT任务的BLEU值提升+4.0-5.1点,ST任务的BLEU值提升+2.1-2.2点。