The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic communication. In this paper, we introduce ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset collected from four online platforms, including social media, e-commerce, and customer service domains, covering Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). The dataset was annotated based on Arabic linguistic traditions and pragmatic theory, resulting in three classes: polite, impolite, and neutral. It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models. The dataset aims to support research on politeness-aware Arabic NLP.
翻译:随着文化感知自然语言处理系统的重要性日益增长,对捕捉不同语言社会语用现象资源的需求也在不断增加。尽管阿拉伯语交流中蕴含着丰富而复杂的礼貌表达,但用于礼貌检测的阿拉伯语资源仍未得到充分探索。本文介绍了ADAB(阿拉伯语礼貌数据集),这是一个新标注的阿拉伯语数据集,采集自包括社交媒体、电子商务和客户服务领域在内的四个在线平台,涵盖现代标准阿拉伯语及多种方言(海湾、埃及、黎凡特和马格里布方言)。该数据集基于阿拉伯语语言学传统和语用学理论进行标注,形成三个类别:礼貌、不礼貌和中性。数据集包含10,000个样本,标注了16个礼貌类别的语言特征,并实现了较高的标注者间一致性(kappa = 0.703)。我们对40种模型配置进行了基准测试,包括传统机器学习、基于Transformer的模型以及大语言模型。该数据集旨在支持面向礼貌感知的阿拉伯语自然语言处理研究。