Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ~45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We will release the dialectal translation models and benchmarks curated in this study.
翻译:阿拉伯语拥有丰富的方言多样性,但在大语言模型中仍存在显著的代表性不足问题,尤其在方言变体方面。为弥补这一不足,我们通过机器翻译结合人工后编辑的方式,构建了包含现代标准阿拉伯语及七种方言的合成数据集。本文提出AraDiCE——一个面向阿拉伯语方言与文化评估的基准测试体系。我们评估了大语言模型在方言理解与生成方面的能力,特别聚焦于资源稀缺的阿拉伯语方言。此外,我们首次提出了细粒度文化认知基准,用于评估模型在海湾地区、埃及及黎凡特区域的文化感知能力,为大语言模型评估提供了全新维度。研究结果表明:虽然Jais、AceGPT等阿拉伯语专用模型在方言任务上优于多语言模型,但在方言识别、生成与翻译方面仍存在显著挑战。本研究贡献了约4.5万条后编辑样本及文化评估基准,并强调针对性训练对提升大语言模型捕捉多样阿拉伯方言与文化语境细微差异能力的重要性。我们将公开本研究中构建的方言翻译模型与基准测试资源。