Privacy captions are short sentences that succinctly describe what personal information is used, how it is used, and why, within an app. These captions can be utilized in various notice formats, such as privacy policies, app rationales, and app store descriptions. However, inaccurate captions may mislead users and expose developers to regulatory fines. Existing approaches to generating privacy notices or just privacy captions include using questionnaires, templates, static analysis, or machine learning. However, these approaches either rely heavily on developers' inputs and thus strain their efforts, use limited source code context, leading to the incomplete capture of app privacy behaviors, or depend on potentially inaccurate privacy policies as a source for creating notices. In this work, we address these limitations by developing Privacy Caption Generator (PCapGen), an approach that - i) automatically identifies and extracts large and precise source code context that implements privacy behaviors in an app, ii) uses a Large Language Model (LLM) to describe coarse- and fine-grained privacy behaviors, and iii) generates accurate, concise, and complete privacy captions to describe the privacy behaviors of the app. Our evaluation shows PCapGen generates concise, complete, and accurate privacy captions as compared to the baseline approach. Furthermore, privacy experts choose PCapGen captions at least 71\% of the time, whereas LLMs-as-judge prefer PCapGen captions at least 76\% of the time, indicating strong performance of our approach.
翻译:隐私摘要是一种简洁描述应用程序中使用了哪些个人信息、如何使用以及为何使用的短句。这些摘要可用于多种通知形式,如隐私政策、应用原理说明和应用商店描述。然而,不准确的摘要可能误导用户并使开发者面临监管罚款。现有生成隐私通知或隐私摘要的方法包括使用问卷、模板、静态分析或机器学习。但这些方法要么严重依赖开发者输入从而增加其负担,要么使用有限的源代码上下文导致无法完整捕获应用隐私行为,要么依赖可能不准确的隐私政策作为生成通知的来源。本研究通过开发隐私摘要生成器(PCapGen)来解决这些局限性,该方法能够:i) 自动识别并提取实现应用隐私行为的大规模精确源代码上下文;ii) 使用大语言模型(LLM)描述粗粒度和细粒度的隐私行为;iii) 生成准确、简洁且完整的隐私摘要以描述应用的隐私行为。评估结果表明,与基线方法相比,PCapGen生成的隐私摘要更简洁、完整且准确。此外,隐私专家在至少71%的情况下选择PCapGen生成的摘要,而基于LLM的评估者在至少76%的情况下更倾向于PCapGen摘要,这表明我们的方法具有优越性能。