A Multi-solution Study on GDPR AI-enabled Completeness Checking of DPAs

Specifying legal requirements for software systems to ensure their compliance with the applicable regulations is a major concern to requirements engineering (RE). Personal data which is collected by an organization is often shared with other organizations to perform certain processing activities. In such cases, the General Data Protection Regulation (GDPR) requires issuing a data processing agreement (DPA) which regulates the processing and further ensures that personal data remains protected. Violating GDPR can lead to huge fines reaching to billions of Euros. Software systems involving personal data processing must adhere to the legal obligations stipulated in GDPR and outlined in DPAs. Requirements engineers can elicit from DPAs legal requirements for regulating the data processing activities in software systems. Checking the completeness of a DPA according to the GDPR provisions is therefore an essential prerequisite to ensure that the elicited requirements are complete. Analyzing DPAs entirely manually is time consuming and requires adequate legal expertise. In this paper, we propose an automation strategy to address the completeness checking of DPAs against GDPR. Specifically, we pursue ten alternative solutions which are enabled by different technologies, namely traditional machine learning, deep learning, language modeling, and few-shot learning. The goal of our work is to empirically examine how these different technologies fare in the legal domain. We computed F2 score on a set of 30 real DPAs. Our evaluation shows that best-performing solutions yield F2 score of 86.7% and 89.7% are based on pre-trained BERT and RoBERTa language models. Our analysis further shows that other alternative solutions based on deep learning (e.g., BiLSTM) and few-shot learning (e.g., SetFit) can achieve comparable accuracy, yet are more efficient to develop.

翻译：软件系统遵循适用法规的法律需求规范是需求工程（RE）的核心关注点。组织收集的个人数据往往被共享给其他机构以执行特定处理活动。在此情况下，《通用数据保护条例》（GDPR）要求签署数据处理协议（DPA），以规范数据处理行为并确保个人数据持续受到保护。违反GDPR可能导致高达数十亿欧元的巨额罚款。涉及个人数据处理的软件系统必须遵守GDPR规定的法律义务及DPA中阐明的条款。需求工程师可从DPA中提取法律需求，以规范软件系统中的数据处理活动。根据GDPR条款检查DPA的完整性，是确保所提取需求完备性的关键前提。完全人工分析DPA耗时且需具备充分的专业法律知识。本文提出了一种自动化策略，用于解决DPA相对于GDPR的完整性检查问题。具体而言，我们探索了十种基于不同技术的替代方案，包括传统机器学习、深度学习、语言模型以及少样本学习。本研究的目的是通过实证检验这些技术在法律领域的表现。我们在30份真实DPA数据集上计算了F2分数。评估显示，基于预训练BERT和RoBERTa语言模型的最佳方案分别达到86.7%和89.7%的F2分数。进一步分析表明，基于深度学习（如BiLSTM）和少样本学习（如SetFit）的其他替代方案虽能达到相近精度，但开发效率更高。