This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses. The learning-style queries are constructed by a novel reframing paradigm: HILL (Hiding Intention by Learning from LLMs). The deterministic, model-agnostic reframing framework is composed of 4 conceptual components: 1) key concept, 2) exploratory transformation, 3) detail-oriented inquiry, and optionally 4) hypotheticality. Further, new metrics are introduced to thoroughly evaluate the efficiency and harmfulness of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong generalizability. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. On the other hand, results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods. This work exposes significant vulnerabilities of safety measures against learning-style elicitation, highlighting a critical challenge of fulfilling both helpfulness and safety alignments.
翻译:本研究揭示了现代大型语言模型(LLMs)的一个关键安全盲区:与普通教育问题高度相似的学习式查询能够可靠地诱发有害响应。学习式查询通过一种新颖的重构范式构建:HILL(通过向LLMs学习隐藏意图)。这种确定性的、与模型无关的重构框架包含四个概念组件:1)核心概念,2)探索性转换,3)细节导向的询问,以及可选的4)假设性。此外,本文引入了新的评估指标以全面衡量越狱方法的效率与危害性。在AdvBench数据集上对多种模型进行的实验表明,HILL具有强大的泛化能力。它在多数模型及各类恶意类别上均取得了最高的攻击成功率,同时通过简洁的提示保持高效率。另一方面,多种防御方法的测试结果显示了HILL的鲁棒性——大多数防御措施效果有限,甚至可能提升攻击成功率。此外,通过对构建的安全提示进行防御评估,本研究揭示了LLMs安全机制的内在局限性及现有防御方法的缺陷。这项工作暴露了安全措施在面对学习式诱导时存在的显著脆弱性,凸显了同时实现助人为乐与安全对齐目标所面临的关键挑战。