When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Retrieval-Augmented Generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge, but its reliance on potentially poisonable knowledge bases introduces new availability risks. Attackers can inject documents that cause LLMs to refuse benign queries, attacks known as blocking attacks. Prior blocking attacks relying on adversarial suffixes or explicit instruction injection are increasingly ineffective against modern safety-aligned LLMs. We observe that safety-aligned LLMs exhibit heightened sensitivity to query-relevant risk signals, causing alignment mechanisms designed for harm prevention to become a source of exploitable refusal. Moreover, mainstream alignment practices share overlapping risk categories and refusal criteria, a phenomenon we term alignment homogeneity, enabling restricted risk context constructed on an accessible LLM to transfer across LLMs. Based on this insight, we propose TabooRAG, a transferable blocking attack framework operating under a strict black-box setting. An attacker can generate a single retrievable blocking document per query by optimizing against a surrogate LLM in an accessible RAG environment, and directly transfer it to an unknown target RAG system without access to the target model. We further introduce a query-aware strategy library to reuse previously effective strategies and improve optimization efficiency. Experiments across 7 modern LLMs and 3 datasets demonstrate that TabooRAG achieves stable cross-model transferability and state-of-the-art blocking success rates, reaching up to 96% on GPT-5.2. Our findings show that increasingly standardized safety alignment across modern LLMs creates a shared and transferable attack surface in RAG systems, revealing a need for improved defenses.

翻译：检索增强生成（RAG）通过整合外部知识增强了大型语言模型（LLM）的能力，但其对可能被投毒的知识库的依赖引入了新的可用性风险。攻击者可以注入导致LLM拒绝良性查询的文档，此类攻击被称为阻塞攻击。先前依赖对抗性后缀或显式指令注入的阻塞攻击对现代安全对齐的LLM日益失效。我们观察到，安全对齐的LLM对查询相关的风险信号表现出更高的敏感性，导致原本为预防危害而设计的对齐机制成为可利用的拒绝来源。此外，主流对齐实践共享重叠的风险类别和拒绝标准，这一现象我们称之为对齐同质性，使得在可访问LLM上构建的受限风险上下文能够在不同LLM间迁移。基于这一洞察，我们提出了TabooRAG，一个在严格黑盒设置下运行的可迁移阻塞攻击框架。攻击者可以在可访问的RAG环境中通过针对代理LLM进行优化，为每个查询生成单个可检索的阻塞文档，并将其直接迁移到未知的目标RAG系统，而无需访问目标模型。我们进一步引入了查询感知策略库，以复用先前有效的策略并提高优化效率。在7个现代LLM和3个数据集上的实验表明，TabooRAG实现了稳定的跨模型可迁移性和最先进的阻塞成功率，在GPT-5.2上最高可达96%。我们的研究结果表明，现代LLM日益标准化的安全对齐在RAG系统中创建了一个共享且可迁移的攻击面，揭示出改进防御的必要性。