The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

As Large Language Models (LLMs) become integral to computing infrastructure, safety alignment serves as the primary security control preventing the generation of harmful payloads. However, this defense remains brittle. Existing jailbreak attacks typically bifurcate into white-box methods, which are inapplicable to commercial APIs due to lack of gradient access, and black-box optimization techniques, which often yield unnatural (e.g., syntactically rigid) or non-transferable (e.g., lacking cross-model generalization) prompts. In this work, we introduce TrojFill, a black-box exploitation framework that bypasses safety filters by targeting a fundamental logic flaw in current alignment paradigms: the decoupling of unsafety reasoning from content generation. TrojFill structurally reframes malicious instructions as a template-filling task required for safety analysis. By embedding obfuscated payloads (e.g., via placeholder substitution) into a "Trojan" structure, the attack induces the model to generate prohibited content as a "demonstrative example" ostensibly required for a subsequent sentence-by-sentence safety critique. This approach effectively masks the malicious intent from standard intent classifiers. We evaluate TrojFill against representative commercial systems, including GPT-4o, Gemini-2.5, DeepSeek-3.1, and Qwen-Max. Our results demonstrate that TrojFill achieves near-universal bypass rates: reaching 100% Attack Success Rate (ASR) on Gemini-flash-2.5 and DeepSeek-3.1, and 97% on GPT-4o, significantly outperforming existing black-box baselines. Furthermore, unlike optimization-based adversarial prompts, TrojFill generates highly interpretable and transferable attack vectors, exposing a systematic vulnerability inaligned LLMs.

翻译：随着大语言模型（LLMs）日益成为计算基础设施的核心组成部分，安全对齐作为主要的安全控制手段，旨在防止生成有害内容。然而，这种防御机制仍然脆弱。现有的越狱攻击通常分为两类：白盒方法因缺乏梯度访问权限而不适用于商业API；黑盒优化技术则往往产生不自然（例如句法僵化）或不可迁移（例如缺乏跨模型泛化能力）的提示。本文中，我们提出了TrojFill，一种黑盒利用框架，它通过针对当前对齐范式中的一个根本性逻辑缺陷——将不安全推理与内容生成解耦——来绕过安全过滤器。TrojFill在结构上重新构建恶意指令，将其伪装成安全分析所需的模板填充任务。通过将混淆后的载荷（例如通过占位符替换）嵌入到一个“特洛伊”结构中，该攻击诱导模型生成被禁止的内容，并将其作为“演示示例”，表面上是为了后续逐句安全批判所需。这种方法有效地将恶意意图从标准意图分类器中隐藏起来。我们在包括GPT-4o、Gemini-2.5、DeepSeek-3.1和Qwen-Max在内的代表性商业系统上评估了TrojFill。我们的结果表明，TrojFill实现了近乎普遍的绕过成功率：在Gemini-flash-2.5和DeepSeek-3.1上达到100%的攻击成功率（ASR），在GPT-4o上达到97%，显著优于现有的黑盒基线方法。此外，与基于优化的对抗性提示不同，TrojFill生成高度可解释且可迁移的攻击向量，揭示了对齐后大语言模型中存在的系统性漏洞。