AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ~45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We will release the dialectal translation models and benchmarks curated in this study.

翻译：阿拉伯语拥有丰富的方言多样性，但在大语言模型中仍存在显著的代表性不足问题，尤其在方言变体方面。为弥补这一不足，我们通过机器翻译结合人工后编辑的方式，构建了包含现代标准阿拉伯语及七种方言的合成数据集。本文提出AraDiCE——一个面向阿拉伯语方言与文化评估的基准测试体系。我们评估了大语言模型在方言理解与生成方面的能力，特别聚焦于资源稀缺的阿拉伯语方言。此外，我们首次提出了细粒度文化认知基准，用于评估模型在海湾地区、埃及及黎凡特区域的文化感知能力，为大语言模型评估提供了全新维度。研究结果表明：虽然Jais、AceGPT等阿拉伯语专用模型在方言任务上优于多语言模型，但在方言识别、生成与翻译方面仍存在显著挑战。本研究贡献了约4.5万条后编辑样本及文化评估基准，并强调针对性训练对提升大语言模型捕捉多样阿拉伯方言与文化语境细微差异能力的重要性。我们将公开本研究中构建的方言翻译模型与基准测试资源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日