Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Abdellah El Mekki,Samar M. Magdy,Houdaifa Atou,Ruwa AbuHweidi,Baraah Qawasmeh,Omer Nacar,Thikra Al-hibiri,Razan Saadie,Hamzah Alsayadi,Nadia Ghezaiel Hammouda,Alshima Alkhazimi,Aya Hamod,Al-Yas Al-Ghafri,Wesam El-Sayed,Asila Al sharji,Mohamad Ballout,Anas Belfathi,Karim Ghaddar,Serry Sibaee,Alaa Aoun,Areej Asiri,Lina Abureesh,Ahlam Bashiti,Majdal Yousef,Abdulaziz Hafiz,Yehdih Mohamed,Emira Hamedtou,Brakehe Brahim,Rahaf Alhamouri,Youssef Nafea,Aya El Aatar,Walid Al-Dhabyani,Emhemed Hamed,Sara Shatnawi,Fakhraddin Alwajih,Khalid Elkhidir,Ashwag Alasmari,Abdurrahman Gerrio,Omar Alshahri,AbdelRahim A. Elmadany,Ismail Berrada,Amir Azad Adli Alkathiri,Fadi A Zaraket,Mustafa Jarrar,Yahya Mohamed El Hadj,Hassan Alhuzali,Muhammad Abdul-Mageed

from arxiv, Project resources will be available here: https://github.com/UBC-NLP/Alexandria

Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.

翻译：阿拉伯语是一种高度双言现象的语言，日常交流大多使用地区方言而非现代标准阿拉伯语。尽管如此，机器翻译系统对方言输入的泛化能力通常较差，这限制了对数百万使用者的实用性。我们推出了**亚历山大**，这是一个大规模、社区驱动、人工翻译的数据集，旨在弥合这一鸿沟。亚历山大覆盖了13个阿拉伯国家和11个高影响力领域，包括健康、教育和农业。与以往资源不同，亚历山大通过将语料贡献与其来源城市元数据关联，提供了前所未有的细粒度，捕捉了超越粗略区域标签的真实地方变体。该数据集包含多轮对话场景，并标注了说话者-受话者性别配置，从而支持对方言使用中性别条件化变异的研究。亚历山大共计包含10.7万个样本，既可作为训练资源，也可作为评估机器翻译和大语言模型的严格基准。我们对具备阿拉伯语能力的大语言模型进行的自动与人工评估，衡量了当前在多样阿拉伯语方言及次方言间翻译的能力，同时揭示了持续存在的重大挑战。