The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

The widespread deployment of large language models (LLMs) across various domains has showcased their immense potential while exposing significant safety vulnerabilities. A major concern is ensuring that LLM-generated content aligns with human values. Existing jailbreak techniques reveal how this alignment can be compromised through specific prompts or adversarial suffixes. In this study, we introduce a new threat: LLMs' bias toward authority. While this inherent bias can improve the quality of outputs generated by LLMs, it also introduces a potential vulnerability, increasing the risk of producing harmful content. Notably, the biases in LLMs is the varying levels of trust given to different types of authoritative information in harmful queries. For example, malware development often favors trust GitHub. To better reveal the risks with LLM, we propose DarkCite, an adaptive authority citation matcher and generator designed for a black-box setting. DarkCite matches optimal citation types to specific risk types and generates authoritative citations relevant to harmful instructions, enabling more effective jailbreak attacks on aligned LLMs.Our experiments show that DarkCite achieves a higher attack success rate (e.g., LLama-2 at 76% versus 68%) than previous methods. To counter this risk, we propose an authenticity and harm verification defense strategy, raising the average defense pass rate (DPR) from 11% to 74%. More importantly, the ability to link citations to the content they encompass has become a foundational function in LLMs, amplifying the influence of LLMs' bias toward authority.

翻译：大语言模型（LLM）在各领域的广泛应用展示了其巨大潜力，同时也暴露了显著的安全漏洞。一个主要关切是确保LLM生成的内容符合人类价值观。现有的越狱技术揭示了如何通过特定提示或对抗性后缀破坏这种对齐。在本研究中，我们引入了一种新的威胁：LLM对权威的偏见。虽然这种内在偏见可以提高LLM生成输出的质量，但它也引入了潜在的脆弱性，增加了产生有害内容的风险。值得注意的是，LLM的偏见体现在对有害查询中不同类型权威信息给予的不同信任程度。例如，恶意软件开发通常更信任GitHub。为了更好地揭示LLM的风险，我们提出了DarkCite，一种为黑盒设置设计的自适应权威引用匹配器和生成器。DarkCite将最优引用类型与特定风险类型相匹配，并生成与有害指令相关的权威引用，从而实现对已对齐LLM更有效的越狱攻击。我们的实验表明，DarkCite实现了比先前方法更高的攻击成功率（例如，LLama-2达到76%对比68%）。为了应对此风险，我们提出了一种真实性与危害性验证防御策略，将平均防御通过率（DPR）从11%提升至74%。更重要的是，将引用与其涵盖内容关联起来的能力已成为LLM的一项基础功能，这放大了LLM对权威偏见的影响。