Privacy policies inform users about the data management practices of organizations. Yet, their complexity often renders them largely incomprehensible to the average user, necessitating the development of privacy assistants. With the advent of generative AI (genAI) technologies, there is an untapped potential to enhance privacy assistants in answering user queries effectively. However, the reliability of genAI remains a concern due to its propensity for generating incorrect or misleading information. This study introduces GenAIPABench, a novel benchmarking framework designed to evaluate the performance of Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench comprises: 1) A comprehensive set of questions about an organization's privacy policy and a data protection regulation, along with annotated answers for several organizations and regulations; 2) A robust set of evaluation metrics for assessing the accuracy, relevance, and consistency of the generated responses; and 3) An evaluation tool that generates appropriate prompts to introduce the system to the privacy document and different variations of the privacy questions to evaluate its robustness. We use GenAIPABench to assess the potential of three leading genAI systems in becoming GenAIPAs: ChatGPT, Bard, and Bing AI. Our results demonstrate significant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy.
翻译:隐私政策告知用户组织的数据管理实践。然而,其复杂性常使普通用户难以理解,因此亟需开发隐私助手。随着生成式人工智能(genAI)技术的发展,其在增强隐私助手有效回答用户查询方面存在尚未开发的潜力。然而,由于genAI易产生错误或误导性信息,其可靠性仍是关键问题。本研究提出GenAIPABench这一新型基准测试框架,旨在评估基于生成式人工智能的隐私助手(GenAIPA)的性能。GenAIPABench包含:1)一套围绕组织隐私政策与数据保护法规的综合性问题集,并附有多个组织与法规的标注答案;2)一套稳健的评估指标集,用于衡量生成响应的准确性、相关性和一致性;3)一个评估工具,可生成恰当的提示(prompt)以向系统介绍隐私文档,以及不同变体的隐私问题以评估其鲁棒性。我们利用GenAIPABench评估了三个主流genAI系统(ChatGPT、Bard与Bing AI)作为GenAIPA的潜力。结果表明,genAI在隐私领域展现出显著潜力,同时也在处理复杂查询、确保一致性及验证源准确性方面面临挑战。