Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.
翻译:数据稀疏性是代码转换(CS)面临的主要挑战之一,而在形态丰富的语言中这一问题更为突出。在机器翻译(MT)任务中,形态分词已被证明能有效缓解单语环境下的数据稀疏性,但其在代码转换场景下的效果尚未得到充分研究。本文研究了不同分词方法(基于形态和基于频率的分词技术)对机器翻译性能的影响,并在埃及阿拉伯语-英语代码转换至英语的翻译任务上进行实验。我们提供了详细分析,考察了数据规模、不同代码转换程度的句子等多种条件。实验结果表明,形态感知分词器在分词任务中表现最佳,但在机器翻译中效果较差。然而,我们发现机器翻译中分词方案的选择高度依赖数据规模。在极端低资源场景下,基于频率和形态分词的组合方案表现最佳;而在资源更充足的场景下,这种组合方案相比单独使用基于频率的分词并未带来显著提升。