In our research, we pioneer a novel approach to evaluate the effectiveness of jailbreak attacks on Large Language Models (LLMs), such as GPT-4 and LLaMa2, diverging from traditional robustness-focused binary evaluations. Our study introduces two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework, using a scoring range from 0 to 1, offers a unique perspective, enabling a more comprehensive and nuanced evaluation of attack effectiveness and empowering attackers to refine their attack prompts with greater understanding. Furthermore, we have developed a comprehensive ground truth dataset specifically tailored for jailbreak tasks. This dataset not only serves as a crucial benchmark for our current study but also establishes a foundational resource for future research, enabling consistent and comparative analyses in this evolving field. Upon meticulous comparison with traditional evaluation methods, we discovered that our evaluation aligns with the baseline's trend while offering a more profound and detailed assessment. We believe that by accurately evaluating the effectiveness of attack prompts in the Jailbreak task, our work lays a solid foundation for assessing a wider array of similar or even more complex tasks in the realm of prompt injection, potentially revolutionizing this field.
翻译:本研究开创性地提出了一种评估大型语言模型(如GPT-4和LLaMa2)越狱攻击有效性的新方法,突破了传统以鲁棒性为核心的二元评估模式。我们引入了两种不同的评估框架:粗粒度评估与细粒度评估。每种框架均采用0到1的评分区间,从独特视角出发,能够更全面、细致地评估攻击效果,帮助攻击者更深入地优化其攻击提示。此外,我们专门构建了针对越狱任务的综合真实数据集。该数据集不仅为当前研究提供了关键基准测试,也为未来研究奠定了基础资源,支持该领域不断演进中的一致性分析与比较研究。通过与传统评估方法的细致对比,我们发现本评估方法在保持基线趋势一致性的同时,能够提供更深刻、更详细的评估结果。我们相信,通过准确评估越狱任务中攻击提示的有效性,本研究为评估提示注入领域内更广泛的类似甚至更复杂任务奠定了坚实基础,有望推动该领域的革新性发展。