Large language models (LLMs) are increasingly utilized in applications where system prompts, which guide model outputs, play a crucial role. These prompts often contain business logic and sensitive information, making their protection essential. However, adversarial and even regular user queries can exploit LLM vulnerabilities to expose these hidden prompts. To address this issue, we present PromptKeeper, a novel defense mechanism for system prompt privacy. By reliably detecting worst-case leakage and regenerating outputs without the system prompt when necessary, PromptKeeper ensures robust protection against prompt extraction attacks via either adversarial or regular queries, while preserving conversational capability and runtime efficiency during benign user interactions.
翻译:大型语言模型(LLMs)在各类应用中的使用日益广泛,其中指导模型输出的系统提示起着至关重要的作用。这些提示通常包含业务逻辑和敏感信息,因此对其加以保护至关重要。然而,恶意甚至常规的用户查询都可能利用LLM的漏洞来暴露这些隐藏的提示。为解决这一问题,我们提出了PromptKeeper,一种用于保护系统提示隐私的新型防御机制。通过可靠地检测最坏情况下的信息泄露,并在必要时重新生成不含系统提示的输出,PromptKeeper能够有效抵御通过恶意或常规查询进行的提示提取攻击,同时在良性用户交互过程中保持对话能力和运行效率。