This paper introduces the Saudi Privacy Policy Dataset, a diverse compilation of Arabic privacy policies from various sectors in Saudi Arabia, annotated according to the 10 principles of the Personal Data Protection Law (PDPL); the PDPL was established to be compatible with General Data Protection Regulation (GDPR); one of the most comprehensive data regulations worldwide. Data were collected from multiple sources, including the Saudi Central Bank, the Saudi Arabia National United Platform, the Council of Health Insurance, and general websites using Google and Wikipedia. The final dataset includes 1,000 websites belonging to 7 sectors, 4,638 lines of text, 775,370 tokens, and a corpus size of 8,353 KB. The annotated dataset offers significant reuse potential for assessing privacy policy compliance, benchmarking privacy practices across industries, and developing automated tools for monitoring adherence to data protection regulations. By providing a comprehensive and annotated dataset of privacy policies, this paper aims to facilitate further research and development in the areas of privacy policy analysis, natural language processing, and machine learning applications related to privacy and data protection, while also serving as an essential resource for researchers, policymakers, and industry professionals interested in understanding and promoting compliance with privacy regulations in Saudi Arabia.
翻译:本文介绍了沙特隐私政策数据集,该数据集涵盖了沙特阿拉伯各行业阿拉伯语隐私政策的多元汇编,并根据《个人数据保护法》(PDPL)的10项原则进行标注;PDPL的制定旨在与《通用数据保护条例》(GDPR)——全球最全面的数据法规之一——保持一致。数据收集自多个来源,包括沙特中央银行、沙特阿拉伯国家联合平台、健康保险委员会,以及通过谷歌和维基百科获取的通用网站。最终数据集包含属于7个行业的1,000个网站、4,638行文本、775,370个词元,以及8,353 KB的语料库规模。该标注数据集在评估隐私政策合规性、基准化跨行业隐私实践、以及开发用于监测数据保护法规遵守情况的自动化工具方面具有显著的重用潜力。通过提供全面且标注的隐私政策数据集,本文旨在促进隐私政策分析、自然语言处理以及与隐私和数据保护相关的机器学习应用领域的进一步研究与发展,同时为有兴趣了解并促进沙特阿拉伯隐私法规合规的研究人员、政策制定者和行业专业人士提供重要资源。