While resources for English language are fairly sufficient to understand content on social media, similar resources in Arabic are still immature. The main reason that the resources in Arabic are insufficient is that Arabic has many dialects in addition to the standard version (MSA). Arabs do not use MSA in their daily communications; rather, they use dialectal versions. Unfortunately, social users transfer this phenomenon into their use of social media platforms, which in turn has raised an urgent need for building suitable AI models for language-dependent applications. Existing machine translation (MT) systems designed for MSA fail to work well with Arabic dialects. In light of this, it is necessary to adapt to the informal nature of communication on social networks by developing MT systems that can effectively handle the various dialects of Arabic. Unlike for MSA that shows advanced progress in MT systems, little effort has been exerted to utilize Arabic dialects for MT systems. While few attempts have been made to build translation datasets for dialectal Arabic, they are domain dependent and are not OSN cultural-language friendly. In this work, we attempt to alleviate these limitations by proposing an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects: Gulf, Yemeni, Iraqi, and Levantine. To perform the translation, we followed our proposed guideline framework for content translation, which could be universally applicable for translation between foreign languages and local dialects. We validated the authenticity of our proposed dataset by developing neural MT models for four Arabic dialects. Our results have shown a superior performance of our NMT models trained using our dataset. We believe that our dataset can reliably serve as an Arabic multidialectal translation dataset for informal MT tasks.
翻译:尽管英语资源足以理解社交媒体内容,但阿拉伯语的类似资源仍不成熟。阿拉伯语资源不足的主要原因是,除标准阿拉伯语(MSA)外,该语言还包含众多方言。阿拉伯人在日常交流中并不使用标准阿拉伯语,而是使用方言版本。不幸的是,社交媒体用户将这一现象带入其对社交平台的使用中,这反过来迫切要求为语言相关应用构建合适的人工智能模型。现有针对标准阿拉伯语设计的机器翻译(MT)系统无法有效处理阿拉伯方言。鉴于此,有必要通过开发能有效处理阿拉伯语各种方言的机器翻译系统,来适应社交网络中非正式交流的特性。与在机器翻译系统方面取得显著进展的标准阿拉伯语不同,针对阿拉伯方言在机器翻译系统上的应用研究甚少。虽然有少数尝试构建方言阿拉伯语翻译数据集,但这些数据集具有领域依赖性,且不适用于在线社交网络(OSN)的文化语言环境。为缓解这些局限性,本研究提出一种基于在线社交网络的多方言阿拉伯语数据集,通过将英文推文按语境翻译为四种阿拉伯方言(海湾方言、也门方言、伊拉克方言和黎凡特方言)来构建。我们遵循所提出的内容翻译指导框架完成翻译,该框架可普遍适用于外语与本地方言之间的翻译。通过为四种阿拉伯方言开发神经机器翻译模型,我们验证了所提数据集的真实性。实验结果表明,使用我们的数据集训练的神经机器翻译模型性能显著优越。我们认为,本数据集可可靠地作为面向非正式机器翻译任务的阿拉伯语多方言翻译数据集。