Privacy policies inform users about the data management practices of organizations. Yet, their complexity often renders them largely incomprehensible to the average user, necessitating the development of privacy assistants. With the advent of generative AI (genAI) technologies, there is an untapped potential to enhance privacy assistants in answering user queries effectively. However, the reliability of genAI remains a concern due to its propensity for generating incorrect or misleading information. This study introduces GenAIPABench, a novel benchmarking framework designed to evaluate the performance of Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench comprises: 1) A comprehensive set of questions about an organization's privacy policy and a data protection regulation, along with annotated answers for several organizations and regulations; 2) A robust set of evaluation metrics for assessing the accuracy, relevance, and consistency of the generated responses; and 3) An evaluation tool that generates appropriate prompts to introduce the system to the privacy document and different variations of the privacy questions to evaluate its robustness. We use GenAIPABench to assess the potential of three leading genAI systems in becoming GenAIPAs: ChatGPT, Bard, and Bing AI. Our results demonstrate significant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy.
翻译:隐私政策向用户告知组织的数据管理实践,但其复杂性常使普通用户难以理解,因此亟需开发隐私助手。随着生成式AI技术的兴起,利用其提升隐私助手有效回答用户查询的潜力尚未被充分挖掘。然而,生成式AI因易产生错误或误导信息,其可靠性仍存疑虑。本研究提出GenAIPABench,一个新型基准测试框架,旨在评估基于生成式AI的隐私助手(GenAIPAs)的性能。GenAIPABench包含:1)一套关于组织隐私政策及数据保护法规的综合性问题集,附有多个组织和法规的标注答案;2)一组稳健的评估指标,用于衡量生成回复的准确性、相关性和一致性;3)一个评估工具,可生成适当提示以向系统引入隐私文档,并生成不同变体的隐私问题以测试其鲁棒性。我们使用GenAIPABench评估三大主流生成式AI系统——ChatGPT、Bard和Bing AI——作为GenAIPAs的潜力。结果表明,生成式AI在隐私领域展现出显著的应用前景,同时也凸显了其在处理复杂查询、确保一致性和验证来源准确性方面的挑战。