LADFA: A Framework of Using Large Language Models and Retrieval-Augmented Generation for Personal Data Flow Analysis in Privacy Policies

Privacy policies help inform people about organisations' personal data processing practices, covering different aspects such as data collection, data storage, and sharing of personal data with third parties. Privacy policies are often difficult for people to fully comprehend due to the lengthy and complex legal language used and inconsistent practices across different sectors and organisations. To help conduct automated and large-scale analyses of privacy policies, many researchers have studied applications of machine learning and natural language processing techniques, including large language models (LLMs). While a limited number of prior studies utilised LLMs for extracting personal data flows from privacy policies, our approach builds on this line of work by combining LLMs with retrieval-augmented generation (RAG) and a customised knowledge base derived from existing studies. This paper presents the development of LADFA, an end-to-end computational framework, which can process unstructured text in a given privacy policy, extract personal data flows and construct a personal data flow graph, and conduct analysis of the data flow graph to facilitate insight discovery. The framework consists of a pre-processor, an LLM-based processor, and a data flow post-processor. We demonstrated and validated the effectiveness and accuracy of the proposed approach by conducting a case study that involved examining ten selected privacy policies from the automotive industry. Moreover, it is worth noting that LADFA is designed to be flexible and customisable, making it suitable for a range of text-based analysis tasks beyond privacy policy analysis.

翻译：隐私政策有助于告知公众组织处理个人数据的实践，涵盖数据收集、数据存储以及与第三方共享个人数据等不同方面。由于隐私政策通常采用冗长复杂的法律语言，且不同行业和组织的实践存在差异，公众往往难以完全理解其内容。为支持对隐私政策进行自动化大规模分析，许多研究者已探索应用机器学习与自然语言处理技术，包括大型语言模型（LLMs）。尽管先前少数研究尝试利用LLMs从隐私政策中提取个人数据流，但本研究在此基础上进一步结合了检索增强生成（RAG）技术以及基于现有研究构建的定制化知识库。本文提出了LADFA这一端到端计算框架的开发，该框架能够处理给定隐私政策中的非结构化文本，提取个人数据流并构建个人数据流图谱，进而通过对数据流图谱的分析促进洞察发现。该框架由预处理器、基于LLM的处理器以及数据流后处理器组成。我们通过对汽车行业选取的十份隐私政策进行案例研究，验证了所提方法的有效性与准确性。此外，值得指出的是LADFA被设计为具备灵活性与可定制性，使其能够适用于隐私政策分析之外的一系列基于文本的分析任务。