This paper proposes a sequence-to-sequence learning approach for Arabic pronoun resolution, which explores the effectiveness of using advanced natural language processing (NLP) techniques, specifically Bi-LSTM and the BERT pre-trained Language Model, in solving the pronoun resolution problem in Arabic. The proposed approach is evaluated on the AnATAr dataset, and its performance is compared to several baseline models, including traditional machine learning models and handcrafted feature-based models. Our results demonstrate that the proposed model outperforms the baseline models, which include KNN, logistic regression, and SVM, across all metrics. In addition, we explore the effectiveness of various modifications to the model, including concatenating the anaphor text beside the paragraph text as input, adding a mask to focus on candidate scores, and filtering candidates based on gender and number agreement with the anaphor. Our results show that these modifications significantly improve the model's performance, achieving up to 81% on MRR and 71% for F1 score while also demonstrating higher precision, recall, and accuracy. These findings suggest that the proposed model is an effective approach to Arabic pronoun resolution and highlights the potential benefits of leveraging advanced NLP neural models.
翻译:本文提出了一种基于序列到序列学习的阿拉伯语代词消解方法,探讨了利用先进自然语言处理(NLP)技术(特别是Bi-LSTM和BERT预训练语言模型)解决阿拉伯语代词消解问题的有效性。所提出的方法在AnATAr数据集上进行了评估,并将其性能与多个基线模型(包括传统机器学习模型和基于人工特征的模型)进行了比较。我们的结果表明,所提出的模型在所有评估指标上均优于基线模型,包括KNN、逻辑回归和支持向量机(SVM)。此外,我们探索了模型的各种改进措施的有效性,例如将回指文本与段落文本拼接作为输入、添加掩码以聚焦候选分数,以及根据与回指词的性别和数一致性过滤候选结果。我们的结果表明,这些改进显著提升了模型性能,在MRR上达到81%,在F1分数上达到71%,同时展现出更高的精确率、召回率和准确率。这些发现表明,所提出的模型是解决阿拉伯语代词消解的有效方法,并凸显了利用先进NLP神经模型的潜在优势。