Prompt leakage poses a compelling security and privacy threat in LLM applications. Leakage of system prompts may compromise intellectual property, and act as adversarial reconnaissance for an attacker. A systematic evaluation of prompt leakage threats and mitigation strategies is lacking, especially for multi-turn LLM interactions. In this paper, we systematically investigate LLM vulnerabilities against prompt leakage for 10 closed- and open-source LLMs, across four domains. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. Our standardized setup further allows dissecting leakage of specific prompt contents such as task instructions and knowledge documents. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts. We present different combination of defenses against our threat model, including a cost analysis. Our study highlights key takeaways for building secure LLM applications and provides directions for research in multi-turn LLM interactions
翻译:提示泄露在大语言模型应用中构成了严峻的安全与隐私威胁。系统提示的泄露可能损害知识产权,并为攻击者提供对抗性侦察信息。目前缺乏对提示泄露威胁及缓解策略的系统性评估,特别是在多轮大语言模型交互场景中。本文系统研究了10个闭源与开源大语言模型在四个领域中对提示泄露的脆弱性。我们设计了一种独特的威胁模型,该模型利用大语言模型的谄媚效应,将多轮设置下的平均攻击成功率从17.7%提升至86.2%。我们的标准化实验框架进一步支持对特定提示内容(如任务指令与知识文档)泄露的剖析。我们评估了7种黑盒防御策略的缓解效果,并对一个开源模型进行了微调以抵御泄露尝试。针对所提出的威胁模型,我们展示了不同防御策略的组合方案,并进行了成本分析。本研究为构建安全的大语言模型应用提供了关键要点,并为多轮大语言模型交互的研究指明了方向。