Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions

Prompt leakage in large language models (LLMs) poses a significant security and privacy threat, particularly in retrieval-augmented generation (RAG) systems. However, leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner. This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs. Our unique multi-turn threat model leverages the LLM's sycophancy effect and our analysis dissects task instruction and knowledge leakage in the LLM response. In a multi-turn setting, our threat model elevates the average attack success rate (ASR) to 86.2%, including a 99% leakage with GPT-4 and claude-1.3. We find that some black-box LLMs like Gemini show variable susceptibility to leakage across domains - they are more likely to leak contextual knowledge in the news domain compared to the medical domain. Our experiments measure specific effects of 6 black-box defense strategies, including a query-rewriter in the RAG scenario. Our proposed multi-tier combination of defenses still has an ASR of 5.3% for black-box LLMs, indicating room for enhancement and future direction for LLM security research.

翻译：大型语言模型（LLM）中的提示泄露构成了重大的安全与隐私威胁，尤其在检索增强生成（RAG）系统中。然而，针对多轮LLM交互中的泄露问题及其缓解策略，目前尚未有标准化的研究。本文系统地探究了LLM在4个不同领域和10个闭源与开源模型中的提示泄露脆弱性。我们提出的独特多轮威胁模型利用了LLM的谄媚效应，并深入分析了LLM响应中的任务指令泄露与知识泄露。在多轮设定下，该威胁模型将平均攻击成功率（ASR）提升至86.2%，其中包括针对GPT-4和claude-1.3达到99%的泄露率。我们发现部分黑盒LLM（如Gemini）在不同领域呈现出差异化的泄露敏感性——在新闻领域较医疗领域更易泄露上下文知识。实验量化了6种黑盒防御策略的具体效果，包括RAG场景中的查询重写器。即使采用我们提出的多层组合防御方案，黑盒LLM的ASR仍达5.3%，这表明LLM安全研究仍有改进空间和未来发展方向。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日