Large Language Models~(LLMs) have gained immense popularity and are being increasingly applied in various domains. Consequently, ensuring the security of these models is of paramount importance. Jailbreak attacks, which manipulate LLMs to generate malicious content, are recognized as a significant vulnerability. While existing research has predominantly focused on direct jailbreak attacks on LLMs, there has been limited exploration of indirect methods. The integration of various plugins into LLMs, notably Retrieval Augmented Generation~(RAG), which enables LLMs to incorporate external knowledge bases into their response generation such as GPTs, introduces new avenues for indirect jailbreak attacks. To fill this gap, we investigate indirect jailbreak attacks on LLMs, particularly GPTs, introducing a novel attack vector named Retrieval Augmented Generation Poisoning. This method, Pandora, exploits the synergy between LLMs and RAG through prompt manipulation to generate unexpected responses. Pandora uses maliciously crafted content to influence the RAG process, effectively initiating jailbreak attacks. Our preliminary tests show that Pandora successfully conducts jailbreak attacks in four different scenarios, achieving higher success rates than direct attacks, with 64.3\% for GPT-3.5 and 34.8\% for GPT-4.
翻译:大型语言模型(LLMs)已获得广泛普及,并日益应用于各个领域。因此,确保这些模型的安全性至关重要。越狱攻击通过操纵LLMs生成恶意内容,被视为一项重大安全漏洞。现有研究主要聚焦于对LLMs的直接越狱攻击,而对间接方法的探索有限。LLMs集成多种插件(尤其是检索增强生成,即RAG),使其能够将外部知识库(如GPTs)融入响应生成过程,这为间接越狱攻击开辟了新途径。为填补这一空白,我们研究了对LLMs(尤其是GPTs)的间接越狱攻击,提出了一种名为检索增强生成投毒的新型攻击向量。该方法(潘多拉)通过提示操控利用LLMs与RAG之间的协同效应,生成意外响应。潘多拉使用恶意构造的内容影响RAG过程,有效发起越狱攻击。初步测试表明,潘多拉在四种不同场景中成功实施越狱攻击,成功率高于直接攻击,对GPT-3.5达到64.3%,对GPT-4达到34.8%。