Large language models (LLMs) have demonstrated remarkable capabilities, but their power comes with significant security considerations. While extensive research has been conducted on the safety of LLMs in chat mode, the security implications of their function calling feature have been largely overlooked. This paper uncovers a critical vulnerability in the function calling process of LLMs, introducing a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our empirical study, conducted on six state-of-the-art LLMs including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro, reveals an alarming average success rate of over 90\% for this attack. We provide a comprehensive analysis of why function calls are susceptible to such attacks and propose defensive strategies, including the use of defensive prompts. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs, contributing to the field of AI safety by identifying a previously unexplored risk, designing an effective attack method, and suggesting practical defensive measures. Our code is available at https://github.com/wooozihui/jailbreakfunction.
翻译:大型语言模型(LLMs)已展现出卓越的能力,但其强大功能伴随着重大的安全隐患。尽管已有大量研究关注LLMs在聊天模式下的安全性,其函数调用功能的安全影响却长期被忽视。本文揭示了LLMs函数调用过程中的一个关键漏洞,提出了一种新型的"越狱函数"攻击方法,该方法利用了对齐差异、用户胁迫机制以及严格安全过滤器的缺失。我们在包括GPT-4o、Claude-3.5-Sonnet和Gemini-1.5-pro在内的六个前沿LLMs上进行的实证研究表明,该攻击的平均成功率超过90%,这一结果令人警醒。我们深入分析了函数调用易受此类攻击的根本原因,并提出了包括防御性提示词使用在内的防护策略。本研究通过识别先前未被探索的风险、设计有效的攻击方法并提出实用防御措施,强调了加强LLMs函数调用安全机制的迫切需求,为人工智能安全领域作出了贡献。相关代码已发布于https://github.com/wooozihui/jailbreakfunction。