ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

Omer Nacar,Serry Sibaee,Adel Ammar,Yasser Alhabashi,Nadia Samer Sibai,Yara Farouk Ahmed,Ahmed Saud Alqusaiyer,Sulieman Mahmoud AlMahmoud,Abdulrhman Mamdoh Mukhaniq,Lubaba Raed,Sulaiman Mohammed Alatwah,Waad Nasser Alqahtani,Yousif Abdulmajeed Alnasser,Mohamed Aziz Khadraoui,Wadii Boulila

The Arabic language is characterized by a rich tapestry of regional dialects that differ substantially in phonetics and lexicon, reflecting the geographic and cultural diversity of its speakers. Despite the availability of many multi-dialect datasets, mapping speech to fine-grained dialect sources, such as cities, remains underexplored. We present ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first Arabic speech dataset designed explicitly with city-level dialect granularity. The corpus comprises Arabic radio speech collected from streaming services across the Arab world. Our data pipeline captures 30-second segments from verified radio streams, encompassing both Modern Standard Arabic (MSA) and diverse dialectal speech. To ensure reliability, each clip was annotated by one to three native Arabic reviewers who assigned rich metadata, including emotion, speech type, dialect category, and a validity flag for dialect identification tasks. The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries. These fine-grained annotations enable robust multi-task learning, serving as a benchmark for city-level dialect tagging. We detail the data collection methodology, assess audio quality, and provide a comprehensive analysis of label distributions. The dataset is available on: https://huggingface.co/datasets/riotu-lab/ARCADE-full

翻译：阿拉伯语以其丰富的区域性方言为特征，这些方言在语音和词汇方面存在显著差异，反映了使用者的地理和文化多样性。尽管已有许多多方言数据集，但将语音映射到细粒度方言来源（如城市层面）的研究仍显不足。本文介绍ARCADE（阿拉伯语广播音频方言评估语料库），这是首个明确以城市级方言细粒度设计的阿拉伯语语音数据集。该语料库包含从阿拉伯世界流媒体服务收集的阿拉伯语广播语音。我们的数据流水线从经过验证的广播流中捕获30秒片段，涵盖现代标准阿拉伯语（MSA）和多种方言语音。为确保可靠性，每个片段由一至三位阿拉伯语母语评审员标注，他们为其分配了丰富的元数据，包括情感、语音类型、方言类别以及方言识别任务的有效性标志。最终语料库包含6,907条标注和3,790个独立音频片段，覆盖19个国家的58个城市。这些细粒度标注支持稳健的多任务学习，可作为城市级方言标注的基准。我们详细阐述了数据收集方法，评估了音频质量，并对标签分布进行了全面分析。数据集发布于：https://huggingface.co/datasets/riotu-lab/ARCADE-full