As AI assistants become widely used, privacy-aware platforms like Anthropic's Clio have been introduced to generate insights from real-world AI use. Clio's privacy protections rely on layering multiple heuristic techniques together, including PII redaction, clustering, filtering, and LLM-based privacy auditing. In this paper, we put these claims to the test by presenting CLIOPATRA, the first privacy attack against "privacy-preserving" LLM insight systems. The attack involves a realistic adversary that carefully designs and inserts malicious chats into the system to break multiple layers of privacy protections and induce the leakage of sensitive information from a target user's chat. We evaluated CLIOPATRA on synthetically generated medical target chats, demonstrating that an adversary who knows only the basic demographics of a target user and a single symptom can successfully extract the user's medical history in 39% of cases by just inspecting Clio's output. Furthermore, CLIOPATRA can reach close to 100% when Clio is configured with other state-of-the-art models and the adversary's knowledge of the target user is increased. We also show that existing ad hoc mitigations, such as LLM-based privacy auditing, are unreliable and fail to detect major leaks. Our findings indicate that even when layered, current heuristic protections are insufficient to adequately protect user data in LLM-based analysis systems.
翻译:随着人工智能助手被广泛使用,诸如Anthropic的Clio等具备隐私意识的平台已被引入,旨在从真实世界的人工智能使用中生成洞察。Clio的隐私保护依赖于将多种启发式技术分层结合,包括个人身份信息脱敏、聚类、过滤以及基于大语言模型的隐私审计。本文通过提出CLIOPATRA——首个针对“隐私保护”大语言模型洞察系统的隐私攻击——来检验这些声明的有效性。该攻击涉及一个现实的对手,其精心设计并将恶意聊天内容插入系统,以突破多层隐私保护机制,并诱导目标用户聊天记录中的敏感信息泄露。我们在合成生成的医疗目标聊天记录上评估了CLIOPATRA,结果表明,仅知晓目标用户基本人口统计信息和单一症状的对手,仅通过检查Clio的输出,就能在39%的情况下成功提取用户的医疗史。此外,当Clio配置了其他先进模型且对手对目标用户的了解增加时,CLIOPATRA的成功率可接近100%。我们还表明,现有的临时缓解措施(例如基于大语言模型的隐私审计)并不可靠,无法检测到重大泄露。我们的研究结果表明,即使采用分层设计,当前基于启发式的保护措施仍不足以在基于大语言模型的分析系统中充分保护用户数据。