In our research, we pioneer a novel approach to evaluate the effectiveness of jailbreak attacks on Large Language Models (LLMs), such as GPT-4 and LLaMa2, diverging from traditional robustness-focused binary evaluations. Our study introduces two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework, using a scoring range from 0 to 1, offers a unique perspective, enabling a more comprehensive and nuanced evaluation of attack effectiveness and empowering attackers to refine their attack prompts with greater understanding. Furthermore, we have developed a comprehensive ground truth dataset specifically tailored for jailbreak tasks. This dataset not only serves as a crucial benchmark for our current study but also establishes a foundational resource for future research, enabling consistent and comparative analyses in this evolving field. Upon meticulous comparison with traditional evaluation methods, we discovered that our evaluation aligns with the baseline's trend while offering a more profound and detailed assessment. We believe that by accurately evaluating the effectiveness of attack prompts in the Jailbreak task, our work lays a solid foundation for assessing a wider array of similar or even more complex tasks in the realm of prompt injection, potentially revolutionizing this field.
翻译:本研究开创性地提出了一种评估大语言模型(如GPT-4和LLaMa2)越狱攻击有效性的新方法,突破了传统仅关注鲁棒性的二元评估框架。我们引入了两种差异化的评估体系:粗粒度评估与细粒度评估。每个体系均采用0到1的评分区间,从独特视角出发,实现了对攻击有效性的更全面、更细致的评估,使攻击者能够更深入地理解并优化其攻击提示。此外,我们专门构建了面向越狱任务的全方位真实标注数据集。该数据集不仅为当前研究提供了关键基准,更奠定了该领域未来研究的基础资源,确保了这一快速发展领域中评估的一致性与可比性。通过与传统评估方法的严谨对比,我们发现本评估方法在保持基线趋势的前提下,提供了更深刻、更精细的分析。我们坚信,通过对越狱任务中攻击提示有效性的精准评估,本研究为更广泛的提示注入类任务(甚或更复杂任务)的评估奠定了坚实基础,有望推动该领域的范式革新。