This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a dedicated data set from publicly available resources. Subsequently, transformer-based machine learning models are being trained and evaluated. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches, when we apply linguistically informed pre-training methods such as Named Entity Recognition (NER). To our knowledge, this is the first large-scale evaluation for this task in Arabic, as well as the first application of multi-task pre-training in this context.
翻译:本文针对自然语言处理(NLP)领域中的阿拉伯语文本数据分类问题展开研究,重点关注自然语言推理(NLI)与矛盾检测(CD)任务。阿拉伯语被视作资源匮乏型语言,即可用数据集稀少,导致NLP方法的适用性受限。为克服这一局限,我们通过公开资源构建了专用数据集,随后基于Transformer架构的机器学习模型进行训练与评估。研究发现,当应用诸如命名实体识别(NER)等语言学知识预训练方法时,语言特异性模型(AraBERT)的性能可与当前最先进的多语言方法相媲美。据我们所知,这是针对该任务在阿拉伯语上的首次大规模评估,也是该领域中首次应用多任务预训练方法。