Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, PARASITE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, PARASITE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), PARASITE achieves up to 70\% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/PARASITE. WARNING: Our paper contains examples that might be sensitive to the readers!
翻译:大型语言模型(LLM)日益通过从公共市场下载的第三方系统提示进行部署。我们发现了一个关键的供应链漏洞:条件性系统提示投毒,即攻击者将“潜伏代理”注入看似良性的提示中。与旨在广泛破坏拒绝机制的经典越狱方法不同,我们提出的框架 PARASITE 能够优化系统提示,使 LLM 仅在特定查询(例如“我该投票给谁当美国总统?”)时输出受针对性攻击的妥协响应,同时在良性输入上保持高可用性。PARASITE 在严格的黑盒设置下运行(无需访问模型权重),采用两阶段优化策略:先进行全局语义搜索,再进行贪婪的词汇精细调整。在开源模型和商业 API(GPT-4o-mini、GPT-3.5)上测试表明,PARASITE 使目标查询的 F1 分数降低高达 70%,且对通用能力的损害极小。我们进一步证明,这些投毒提示通过利用真实世界系统提示中的自然噪声,能够逃避包括困惑度过滤器和拼写纠正机制在内的标准防御。我们的代码和数据可在 https://github.com/vietph34/PARASITE 获取。警告:本文包含可能对读者敏感的例子!