As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces Absher, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \texttt{Absher} comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.
翻译:随着大型语言模型(LLMs)在阿拉伯语自然语言处理应用中的地位日益重要,评估其对地区方言和文化细微差别的理解能力变得至关重要,尤其是在沙特阿拉伯这样语言多样化的环境中。本文提出了Absher——一个专门用于评估LLMs在主要沙特方言中表现的综合基准。\texttt{Absher} 包含超过18,000道多项选择题,涵盖六个不同类别:词义理解、正误判断、填空补全、语境运用、文化阐释和地域识别。这些问题源自沙特阿拉伯各地区收集的方言词汇、短语和谚语构成的精选数据集。我们评估了包括多语言模型和阿拉伯语专用模型在内的多种前沿大型语言模型,并对其能力与局限进行了深入分析。研究结果揭示了显著的性能差距,特别是在需要文化推理或语境理解的任务中。我们的发现凸显了采用方言感知训练和文化适配评估方法的迫切需求,以提升大型语言模型在现实阿拉伯语应用中的表现。