System prompts are critical for guiding the behavior of Large Language Models (LLMs), yet they often contain proprietary logic or sensitive information, making them a prime target for extraction attacks. Adversarial queries can successfully elicit these hidden instructions, posing significant security and privacy risks. Existing defense mechanisms frequently rely on heuristics, incur substantial computational overhead, or are inapplicable to models accessed via black-box APIs. This paper introduces a novel framework for hardening system prompts through shield appending, a lightweight approach that adds a protective textual layer to the original prompt. Our core contribution is the formalization of prompt hardening as a utility-constrained optimization problem. We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks, while simultaneously preserving task utility above a specified threshold, measured by semantic fidelity to baseline outputs. This black-box, optimization-driven methodology is lightweight and practical, requiring only API access to the target and optimizer LLMs. We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks, outperforming established baseline defenses without compromising the model's intended functionality. Our work presents a paradigm for developing robust, utility-aware defenses in the escalating landscape of LLM security. The code is made public on the following link: https://github.com/psm-defense/psm
翻译:系统提示对于指导大语言模型(LLM)的行为至关重要,但其通常包含专有逻辑或敏感信息,因而成为提取攻击的主要目标。对抗性查询能够成功诱出这些隐藏指令,构成重大的安全和隐私风险。现有的防御机制往往依赖于启发式方法,产生大量计算开销,或无法适用于通过黑盒API访问的模型。本文提出了一种通过屏蔽附加来强化系统提示的新框架,这是一种在原始提示上添加保护性文本层的轻量级方法。我们的核心贡献是将提示强化形式化为一个效用约束优化问题。我们利用LLM作为优化器,在可能的SHIELD空间中进行搜索,旨在最小化由一系列对抗性攻击导出的泄漏度量,同时将任务效用保持在指定阈值以上,该效用通过输出与基线语义保真度来度量。这种基于优化的黑盒方法轻量且实用,仅需通过API访问目标LLM和优化器LLM。我们通过实验证明,优化后的SHIELD能显著降低针对全面提取攻击的提示泄漏,在保持模型预期功能的前提下,优于现有的基线防御方法。我们的工作为在不断升级的LLM安全领域中开发鲁棒且效用感知的防御提供了一种范式。代码已在以下链接公开:https://github.com/psm-defense/psm