Agentic systems are increasingly acting on users' behalf, accessing calendars, email, and personal files to complete everyday tasks. Privacy evaluation for these systems has focused on the input and output boundaries, but each task involves several intermediate information flows, from agent queries to tool responses, that are not currently evaluated. We argue that every boundary in an agentic pipeline is a site of potential privacy violation and must be assessed independently. To support this, we introduce the Privacy Flow Graph, a Contextual Integrity-grounded framework that decomposes agentic execution into a sequence of information flows, each annotated with the five CI parameters, and traces violations to their point of origin. We present AgentSCOPE, a benchmark of 62 multi-tool scenarios across eight regulatory domains with ground truth at every pipeline stage. Our evaluation across seven state-of-the-art LLMs show that privacy violations in the pipeline occur in over 80% of scenarios, even when final outputs appear clean (24%), with most violations arising at the tool-response stage where APIs return sensitive data indiscriminately. These results indicate that output-level evaluation alone substantially underestimates the privacy risk of agentic systems.
翻译:智能体系统正日益代表用户执行任务,通过访问日历、电子邮件和个人文件来完成日常事务。目前对这些系统的隐私评估主要集中于输入和输出边界,但每项任务都涉及多个中间信息流(从智能体查询到工具响应),这些环节尚未得到评估。我们认为,智能体流程中的每个边界都是潜在的隐私泄露点,必须进行独立评估。为此,我们提出了隐私流图——一个基于上下文完整性的框架,它将智能体执行过程分解为一系列信息流,每个流均使用五个CI参数进行标注,并能将隐私违规追溯至其源头。我们推出了AgentSCOPE基准测试,涵盖八个监管领域的62个多工具场景,并在流程每个阶段提供了真实标注。通过对七个最先进的大语言模型的评估,我们发现流程中的隐私违规发生在超过80%的场景中,即使最终输出看似合规(24%),大多数违规出现在工具响应阶段,即API不加区分地返回敏感数据。这些结果表明,仅进行输出层面的评估会严重低估智能体系统的隐私风险。