PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage

Recent studies have discovered that LLMs have serious privacy leakage concerns, where an LLM may be fooled into outputting private information under carefully crafted adversarial prompts. These risks include leaking system prompts, personally identifiable information, training data, and model parameters. Most existing red-teaming approaches for privacy leakage rely on humans to craft the adversarial prompts. A few automated methods are proposed for system prompt extraction, but they cannot be applied to more severe risks (e.g., training data extraction) and have limited effectiveness even for system prompt extraction. In this paper, we propose PrivAgent, a novel black-box red-teaming framework for LLM privacy leakage. We formulate different risks as a search problem with a unified attack goal. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for different target models under different risks. We propose a novel reward function to provide effective and fine-grained rewards for the attack agent. Finally, we introduce customizations to better fit our general framework to system prompt extraction and training data extraction. Through extensive evaluations, we first show that PrivAgent outperforms existing automated methods in system prompt leakage against six popular LLMs. Notably, our approach achieves a 100% success rate in extracting system prompts from real-world applications in OpenAI's GPT Store. We also show PrivAgent's effectiveness in extracting training data from an open-source LLM with a success rate of 5.9%. We further demonstrate PrivAgent's effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study. We release our code here https://github.com/rucnyz/RedAgent.

翻译：近期研究发现，大语言模型存在严重的隐私泄露隐患，在精心构造的对抗性提示下可能被诱导输出隐私信息。这些风险包括系统提示泄露、个人身份信息泄露、训练数据泄露及模型参数泄露。现有针对隐私泄露的对抗测试方法大多依赖人工构造对抗性提示。少数针对系统提示提取的自动化方法已被提出，但无法应用于更严重的风险（如训练数据提取），且即使在系统提示提取任务中效果也有限。本文提出PrivAgent，一种针对大语言模型隐私泄露的新型黑盒对抗测试框架。我们将不同风险形式化为具有统一攻击目标的搜索问题。该框架通过强化学习训练开源大语言模型作为攻击智能体，针对不同目标模型和风险类型生成对抗性提示。我们提出了一种新颖的奖励函数，为攻击智能体提供有效且细粒度的奖励。最后，我们引入定制化设计使通用框架更好地适配系统提示提取和训练数据提取任务。通过大量实验评估，我们首先证明在针对六种主流大语言模型的系统提示泄露测试中，PrivAgent优于现有自动化方法。值得注意的是，本方法在从OpenAI GPT Store的实际应用中提取系统提示时达到100%的成功率。实验还表明PrivAgent在从开源大语言模型提取训练数据时达到5.9%的成功率。我们进一步证明了PrivAgent在规避现有防护机制方面的有效性及其对实现更优安全对齐的促进作用。最后，通过详尽的消融实验验证了定制化设计的有效性。代码已发布于https://github.com/rucnyz/RedAgent。