Privacy policies serve as the primary conduit through which online service providers inform users about their data collection and usage procedures. However, in a bid to be comprehensive and mitigate legal risks, these policy documents are often quite verbose. In practical use, users tend to click the Agree button directly rather than reading them carefully. This practice exposes users to risks of privacy leakage and legal issues. Recently, the advent of Large Language Models (LLM) such as ChatGPT and GPT-4 has opened new possibilities for text analysis, especially for lengthy documents like privacy policies. In this study, we investigate a privacy policy text analysis framework PolicyGPT based on the LLM. This framework was tested using two datasets. The first dataset comprises of privacy policies from 115 websites, which were meticulously annotated by legal experts, categorizing each segment into one of 10 classes. The second dataset consists of privacy policies from 304 popular mobile applications, with each sentence manually annotated and classified into one of another 10 categories. Under zero-shot learning conditions, PolicyGPT demonstrated robust performance. For the first dataset, it achieved an accuracy rate of 97%, while for the second dataset, it attained an 87% accuracy rate, surpassing that of the baseline machine learning and neural network models.
翻译:摘要:隐私政策是在线服务提供商告知用户其数据收集与使用流程的主要渠道。然而,为追求全面性并规避法律风险,这类政策文件往往冗长繁复。实际使用中,用户倾向于直接点击"同意"按钮而非仔细阅读,这种实践使用户面临隐私泄露与法律问题的风险。近期,ChatGPT与GPT-4等大型语言模型(LLM)的出现为文本分析开辟了新可能,尤其适用于隐私政策这类长文档。本研究提出了一种基于LLM的隐私政策文本分析框架PolicyGPT。该框架采用两个数据集进行测试:第一个数据集包含由法律专家精心标注的115个网站隐私政策,每个片段被归类为10个类别之一;第二个数据集涵盖304款热门移动应用的隐私政策,每个句子均被人工标注并归入另一组10个类别。在零样本学习条件下,PolicyGPT展现出稳健性能:针对第一个数据集实现了97%的准确率,针对第二个数据集达到87%的准确率,均超越了基线机器学习与神经网络模型。