System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates ( > 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.
翻译:大语言模型(LLM)中的系统指令常用于强化安全策略、定义智能体行为,并在智能体AI应用中保护敏感操作上下文。这些指令可能包含API凭证、内部策略及特权工作流定义等敏感信息,使得系统指令泄露成为OWASP十大LLM应用安全风险中强调的关键威胁。许多LLM应用在不引入推理模型额外开销的情况下,依赖拒绝式指令来拦截对系统指令的直接请求,隐含假设禁止信息只能通过显式查询提取。我们提出一种自动化评估框架,测试当提取请求被重构为编码或结构化输出任务时系统指令是否仍保持机密性。在四种常见模型与46条已验证系统指令的测试中,结构化序列化攻击的平均成功率超过0.7——模型虽拒绝直接提取请求,却在所请求的序列化格式中泄露受保护内容。我们进一步展示基于思维链推理模型的一次性指令重塑缓解策略,表明仅对系统指令的措辞与结构进行细微调整即可显著降低攻击成功率,且无需重新训练模型。