Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.
翻译:数据稀疏是代码混合(CS)面临的主要挑战之一,对于形态丰富的语言而言这一挑战更加严峻。在机器翻译(MT)任务中,形态学分词已被证明能有效缓解单语场景下的数据稀疏问题,但尚未针对代码混合环境进行探究。本文研究了不同分词方法对机器翻译性能的影响,涵盖基于形态学和基于频率的分词技术。我们在代码混合型阿拉伯语-英语到英语的翻译任务上进行实验,通过详细分析考察了数据规模、不同代码混合程度的句子等多种条件。实验结果表明,形态感知分词器在分词任务中表现最佳,但在机器翻译中效果欠佳。然而研究发现,机器翻译所采用的分词方案高度依赖数据规模。在极低资源场景下,基于频率与形态学的混合分词方案表现最佳;而在资源较充足的场景中,这种混合方案相比单纯使用频率分词并未带来显著提升。