AraPoemBERT: A Pretrained Language Model for Arabic Poetry Analysis

Arabic poetry, with its rich linguistic features and profound cultural significance, presents a unique challenge to the Natural Language Processing (NLP) field. The complexity of its structure and context necessitates advanced computational models for accurate analysis. In this paper, we introduce AraPoemBERT, an Arabic language model pretrained exclusively on Arabic poetry text. To demonstrate the effectiveness of the proposed model, we compared AraPoemBERT with 5 different Arabic language models on various NLP tasks related to Arabic poetry. The new model outperformed all other models and achieved state-of-the-art results in most of the downstream tasks. AraPoemBERT achieved unprecedented accuracy in two out of three novel tasks: poet's gender classification (99.34\% accuracy), and poetry sub-meter classification (97.79\% accuracy). In addition, the model achieved an accuracy score in poems' rhyme classification (97.73\% accuracy) which is almost equivalent to the best score reported in this study. Moreover, the proposed model significantly outperformed previous work and other comparative models in the tasks of poems' sentiment analysis, achieving an accuracy of 78.95\%, and poetry meter classification (99.03\% accuracy), while significantly expanding the scope of these two problems. The dataset used in this study, contains more than 2.09 million verses collected from online sources, each associated with various attributes such as meter, sub-meter, poet, rhyme, and topic. The results demonstrate the effectiveness of the proposed model in understanding and analyzing Arabic poetry, achieving state-of-the-art results in several tasks and outperforming previous works and other language models included in the study. AraPoemBERT model is publicly available on \url{https://huggingface.co/faisalq}.

翻译：阿拉伯诗歌以其丰富的语言特征和深厚的文化意义，对自然语言处理领域构成了独特挑战。其结构与语境的复杂性需要先进的计算模型以实现精准分析。本文提出AraPoemBERT——一个专门在阿拉伯诗歌文本上预训练的阿拉伯语语言模型。为验证该模型的有效性，我们将其与五种不同的阿拉伯语语言模型在多项与阿拉伯诗歌相关的自然语言处理任务上进行了比较。新模型在所有对比模型中表现最优，并在大多数下游任务中达到了当前最优结果。在三个新任务中的两个任务上，AraPoemBERT取得了前所未有的准确率：诗人性别分类（99.34%）和诗歌次格律分类（97.79%）。此外，在诗歌韵律分类任务中，模型准确率为97.73%，与该研究中报告的最佳得分几乎持平。同时，该模型在诗歌情感分析任务（78.95%准确率）和诗歌格律分类任务（99.03%准确率）上显著优于以往研究及其他对比模型，并大幅拓展了这两类问题的研究范围。本研究所用数据集包含超过209万行从网络来源收集的诗句，每行诗句关联多种属性，如格律、次格律、诗人、韵律和主题。结果表明，所提模型在理解和分析阿拉伯诗歌方面具有有效性，在多项任务中达到当前最优水平，并优于以往研究及研究中包含的其他语言模型。AraPoemBERT模型已在\url{https://huggingface.co/faisalq}公开提供。