RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA's effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA's generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.

翻译：近年来，大语言模型（LLMs）的进展已在各类人工智能应用中取得重大突破。然而，其复杂的能力也引发了严重的安全担忧，尤其是通过越狱攻击生成有害内容的问题。当前针对LLMs的安全测试通常依赖于静态数据集，且缺乏系统性的准则来评估这些测试的质量与充分性。尽管覆盖准则在较小的神经网络上已被证明有效，但由于可扩展性问题及目标差异，它们无法直接适用于LLMs。为应对这些挑战，本文提出了RACA，一套专为LLM安全测试设计的新型覆盖准则。RACA利用表征工程技术，聚焦于LLMs内部与安全关键相关的概念，从而降低维度并过滤无关信息。该框架包含三个阶段：首先，它使用一个由专家精心构建的小规模越狱提示校准集，识别出安全关键的表征；其次，基于这些表征为给定的测试套件计算概念激活分数；最后，通过六个子准则计算覆盖结果，这些子准则分别评估个体安全概念及组合安全概念。我们进行了全面的实验以验证RACA的有效性、适用性与泛化能力，结果表明RACA能成功识别高质量的越狱提示，且优于传统的神经元级准则。我们还展示了其在真实场景中的实际应用，例如测试集优先级排序与攻击提示采样。此外，我们的研究结果证实了RACA在不同场景下的泛化能力及其在各种配置下的鲁棒性。总体而言，RACA为评估LLMs的安全性提供了一个新框架，为人工智能测试领域贡献了一项有价值的技术。