ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

Hend Al-Khalifa,Nadia Ghezaiel,Maria Bounnit,Hend Hamed Alhazmi,Noof Abdullah Alfear,Reem Fahad Alqifari,Ameera Masoud Almasoud,Sharefah Ahmed Al-Ghamdi

from arxiv, Paper accepted @ The Fifteenth biennial Language Resources and Evaluation Conference (LREC2026)

The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic communication. In this paper, we introduce ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset collected from four online platforms, including social media, e-commerce, and customer service domains, covering Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). The dataset was annotated based on Arabic linguistic traditions and pragmatic theory, resulting in three classes: polite, impolite, and neutral. It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models. The dataset aims to support research on politeness-aware Arabic NLP.

翻译：随着文化感知自然语言处理系统的重要性日益增长，对捕捉不同语言社会语用现象资源的需求也在不断增加。尽管阿拉伯语交流中蕴含着丰富而复杂的礼貌表达，但用于礼貌检测的阿拉伯语资源仍未得到充分探索。本文介绍了ADAB（阿拉伯语礼貌数据集），这是一个新标注的阿拉伯语数据集，采集自包括社交媒体、电子商务和客户服务领域在内的四个在线平台，涵盖现代标准阿拉伯语及多种方言（海湾、埃及、黎凡特和马格里布方言）。该数据集基于阿拉伯语语言学传统和语用学理论进行标注，形成三个类别：礼貌、不礼貌和中性。数据集包含10,000个样本，标注了16个礼貌类别的语言特征，并实现了较高的标注者间一致性（kappa = 0.703）。我们对40种模型配置进行了基准测试，包括传统机器学习、基于Transformer的模型以及大语言模型。该数据集旨在支持面向礼貌感知的阿拉伯语自然语言处理研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。