Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to attacks that generate harmful or sensitive outputs. As open-source LLMs are increasingly adopted in high-impact applications such as finance, law, and healthcare, systematically investigating their security risks is becoming increasingly important towards trustworthy LLM era. This paper comprehensively studies effective prompt injection attacks against 14 widely used open-source and three closed-source LLMs on five attack benchmarks. Moreover, existing evaluation metrics mostly only consider the attack success rate, overlooking uncertainty in model responses. Our proposed Attack Success Probability (ASP) additionally captures uncertain behaviors for evaluation, where the model may initially refuse a harmful request but subsequently provide harmful guidance or vice versa, reflecting inconsistency and ambiguity in attack feasibility. By systematically analyzing the effectiveness of prompt injection attacks, we propose a straightforward and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around 90% ASP. They also indicate that ignore prefix attacks can break all 14 open-source LLMs, achieving over 60% ASP on a multi-categorical dataset. We find that moderately well-known LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.
翻译:近期研究表明,大语言模型(LLMs)易受生成有害或敏感输出的攻击。随着开源LLMs在金融、法律、医疗等高风险应用中的广泛采用,系统性地研究其安全风险对于迈向可信赖的LLM时代日益重要。本文全面研究了针对14个广泛使用的开源和三个闭源LLMs的五种攻击基准上的有效提示注入攻击。此外,现有评估指标大多仅考虑攻击成功率,忽视了模型响应的不确定性。我们提出的攻击成功概率(ASP)额外捕获了用于评估的不确定性行为,其中模型可能起初拒绝有害请求但随后提供有害指导,反之亦然,反映了攻击可行性中的不一致性和模糊性。通过系统分析提示注入攻击的有效性,我们提出了一种直接有效的催眠攻击;结果表明,该攻击导致包括Stablelm2、Mistral、Openchat和Vicuna在内的对齐语言模型产生不当行为,实现了约90%的ASP。研究还表明,忽略前缀攻击可以攻破所有14个开源LLMs,在多类别数据集上实现超过60%的ASP。我们发现,知名度适中的LLMs对提示注入攻击的脆弱性更高,强调需要提高公众意识并优先考虑高效缓解策略。