Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks

Large language models (LLMs) have revolutionized artificial intelligence, but their increasing deployment across critical domains has raised concerns about their abnormal behaviors when faced with malicious attacks. Such vulnerability alerts the widespread inadequacy of pre-release testing. In this paper, we conduct a comprehensive empirical study to evaluate the effectiveness of traditional coverage criteria in identifying such inadequacies, exemplified by the significant security concern of jailbreak attacks. Our study begins with a clustering analysis of the hidden states of LLMs, revealing that the embedded characteristics effectively distinguish between different query types. We then systematically evaluate the performance of these criteria across three key dimensions: criterion level, layer level, and token level. Our research uncovers significant differences in neuron coverage when LLMs process normal versus jailbreak queries, aligning with our clustering experiments. Leveraging these findings, we propose three practical applications of coverage criteria in the context of LLM security testing. Specifically, we develop a real-time jailbreak detection mechanism that achieves high accuracy (93.61% on average) in classifying queries as normal or jailbreak. Furthermore, we explore the use of coverage levels to prioritize test cases, improving testing efficiency by focusing on high-risk interactions and removing redundant tests. Lastly, we introduce a coverage-guided approach for generating jailbreak attack examples, enabling systematic refinement of prompts to uncover vulnerabilities. This study improves our understanding of LLM security testing, enhances their safety, and provides a foundation for developing more robust AI applications.

翻译：大语言模型（LLM）已经彻底改变了人工智能领域，但其在关键领域日益广泛的部署引发了人们对其在面对恶意攻击时异常行为的担忧。这种脆弱性警示了当前预发布测试普遍存在的不足。本文通过一项全面的实证研究，评估传统覆盖率准则在识别此类不足方面的有效性，并以越狱攻击这一重大安全问题为例进行说明。我们的研究首先对LLM隐藏状态进行聚类分析，发现其嵌入特征能有效区分不同类型的查询。随后，我们系统性地从三个关键维度评估这些准则的性能：准则层面、层级层面和标记层面。研究发现，LLM处理正常查询与越狱查询时的神经元覆盖率存在显著差异，这与我们的聚类实验结果一致。基于这些发现，我们提出了覆盖率准则在LLM安全测试中的三种实际应用。具体而言，我们开发了一种实时越狱检测机制，在区分正常查询与越狱查询时实现了高准确率（平均达93.61%）。此外，我们探索利用覆盖率水平对测试用例进行优先级排序，通过聚焦高风险交互并剔除冗余测试来提高测试效率。最后，我们提出了一种覆盖率引导的越狱攻击样本生成方法，能够通过系统化提示词优化来发现模型漏洞。本研究深化了对LLM安全测试的理解，提升了其安全性，并为开发更健壮的人工智能应用奠定了基础。