Privacy policies disclose how an organization collects and handles personal information. Recent work has made progress in leveraging natural language processing (NLP) to automate privacy policy analysis and extract data collection statements from different sentences, considered in isolation from each other. In this paper, we view and analyze, for the first time, the entire text of a privacy policy in an integrated way. In terms of methodology: (1) we define PoliGraph, a type of knowledge graph that captures statements in a privacy policy as relations between different parts of the text; and (2) we develop an NLP-based tool, PoliGraph-er, to automatically extract PoliGraph from the text. In addition, (3) we revisit the notion of ontologies, previously defined in heuristic ways, to capture subsumption relations between terms. We make a clear distinction between local and global ontologies to capture the context of individual privacy policies, application domains, and privacy laws. Using a public dataset for evaluation, we show that PoliGraph-er identifies 40% more collection statements than prior state-of-the-art, with 97% precision. In terms of applications, PoliGraph enables automated analysis of a corpus of privacy policies and allows us to: (1) reveal common patterns in the texts across different privacy policies, and (2) assess the correctness of the terms as defined within a privacy policy. We also apply PoliGraph to: (3) detect contradictions in a privacy policy, where we show false alarms by prior work, and (4) analyze the consistency of privacy policies and network traffic, where we identify significantly more clear disclosures than prior work.
翻译:隐私政策披露了组织如何收集和处理个人信息。近期研究在利用自然语言处理(NLP)自动化分析隐私政策、从孤立考虑的独立句子中提取数据收集陈述方面取得了进展。本文首次以整体方式对隐私政策全文进行综合审视与分析。方法论层面:(1)我们定义了PoliGraph——一种将隐私政策中陈述捕捉为文本不同部分之间关系的知识图谱类型;(2)开发了基于NLP的工具PoliGraph-er,用于从文本中自动提取PoliGraph。此外,(3)我们重新审视了此前以启发式方式定义的本体概念,以捕获术语之间的包含关系,并明确区分局部本体与全局本体,以涵盖个体隐私政策、应用领域及隐私法律的上下文。基于公开数据集的评估显示,PoliGraph-er比现有最优方法多识别出40%的收集陈述,且精确率达97%。在应用层面,PoliGraph实现了隐私政策语料库的自动化分析,可支持:(1)揭示不同隐私政策文本中的共通模式;(2)评估隐私政策中定义的术语正确性。我们还应用PoliGraph实现了:(3)检测隐私政策中的矛盾——发现了此前研究的误报;(4)分析隐私政策与网络流量的一致性——识别出比先前工作显著更多的明确披露内容。