Jailbreak attacks pose a serious threat to Large Language Models (LLMs) by bypassing their safety mechanisms. A truly advanced jailbreak is defined not only by its effectiveness but, more critically, by its stealthiness. However, existing methods face a fundamental trade-off between semantic stealth (hiding malicious intent) and linguistic stealth (appearing natural), leaving them vulnerable to detection. To resolve this trade-off, we propose StegoAttack, a framework that leverages steganography. The core insight is to embed a harmful query within a benign, semantically coherent paragraph. This design provides semantic stealth by concealing the existence of malicious content and ensures linguistic stealth by maintaining the natural fluency of the cover paragraph. We evaluate StegoAttack on four state-of-the-art, safety-aligned LLMs, including GPT-5 and Gemini-3, and benchmark it against eight leading jailbreak methods. Our results show that StegoAttack achieves an average attack success rate (ASR) of 95.50%, outperforming existing baselines across all four models. Critically, its ASR drops by less than 27.00% under external detectors, while maintaining natural language distribution. This demonstrates that steganography effectively decouples linguistic and semantic stealth, thereby posing a fully concealed yet highly effective security threat. The code is available at https://github.com/GenggengSvan/StegoAttack
翻译:越狱攻击通过绕过安全机制,对大语言模型构成了严重威胁。一个真正高级的越狱不仅由其有效性定义,更关键的是其隐蔽性。然而,现有方法在语义隐蔽性(隐藏恶意意图)和语言隐蔽性(呈现自然性)之间存在根本性的权衡,使其易于被检测。为解决这一权衡,我们提出了StegoAttack框架,该框架利用隐写术。其核心洞见在于将一个有害查询嵌入到一个良性、语义连贯的段落中。这种设计通过隐藏恶意内容的存在来提供语义隐蔽性,并通过保持掩护段落的自然流畅性来确保语言隐蔽性。我们在四个最先进的、经过安全对齐的大语言模型(包括GPT-5和Gemini-3)上评估了StegoAttack,并以八种领先的越狱方法作为基准进行对比。我们的结果表明,StegoAttack实现了平均95.50%的攻击成功率,在所有四个模型上均优于现有基线。至关重要的是,在外部检测器下,其攻击成功率下降不到27.00%,同时保持了自然语言分布。这证明隐写术有效地解耦了语言和语义隐蔽性,从而构成了一种完全隐蔽且高度有效的安全威胁。代码可在 https://github.com/GenggengSvan/StegoAttack 获取。