System prompts that include detailed instructions to describe the task performed by the underlying large language model (LLM) can easily transform foundation models into tools and services with minimal overhead. Because of their crucial impact on the utility, they are often considered intellectual property, similar to the code of a software product. However, extracting system prompts is easily possible by using prompt injection. As of today, there is no effective countermeasure to prevent the stealing of system prompts and all safeguarding efforts could be evaded with carefully crafted prompt injections that bypass all protection mechanisms. In this work, we propose an alternative to conventional system prompts. We introduce prompt obfuscation to prevent the extraction of the system prompt while maintaining the utility of the system itself with only little overhead. The core idea is to find a representation of the original system prompt that leads to the same functionality, while the obfuscated system prompt does not contain any information that allows conclusions to be drawn about the original system prompt. We implement an optimization-based method to find an obfuscated prompt representation while maintaining the functionality. To evaluate our approach, we investigate eight different metrics to compare the performance of a system using the original and the obfuscated system prompts, and we show that the obfuscated version is constantly on par with the original one. We further perform three different deobfuscation attacks and show that with access to the obfuscated prompt and the LLM itself, we are not able to consistently extract meaningful information. Overall, we showed that prompt obfuscation can be an effective method to protect intellectual property while maintaining the same utility as the original system prompt.
翻译:包含详细指令以描述底层大型语言模型(LLM)所执行任务的系统提示,能够以极低开销将基础模型转化为工具与服务。鉴于其对系统功能的关键影响,这类提示常被视为知识产权,类似于软件产品的源代码。然而,通过提示注入技术可轻易提取系统提示。目前尚无有效防护措施能完全阻止系统提示的窃取,所有保护机制均可被精心设计的提示注入绕过。本研究提出一种替代传统系统提示的方案:引入提示混淆技术,在仅增加少量开销的前提下,既能维持系统原有功能,又能防止系统提示被提取。其核心思想是寻找原始系统提示的一种功能等效表征形式,使得混淆后的系统提示不包含任何可推断原始提示的信息。我们实现了一种基于优化的方法,在保持功能不变的前提下寻找混淆后的提示表征。为评估该方案,我们采用八种不同指标对比原始系统提示与混淆系统提示的性能表现,结果表明混淆版本始终与原始版本性能相当。此外,我们实施了三种不同的去混淆攻击,证明即使同时获取混淆提示和LLM模型本身,攻击者仍无法持续提取有效信息。总体而言,本研究证实提示混淆技术能成为保护知识产权的有效手段,同时保持与原始系统提示同等的功能效用。