As artificial intelligence (AI) systems are increasingly deployed across critical domains, their security vulnerabilities pose growing risks of high-profile exploits and consequential system failures. Yet systematic approaches to evaluating AI security remain underdeveloped. In this paper, we introduce AVISE (AI Vulnerability Identification and Security Evaluation), a modular open-source framework for identifying vulnerabilities in and evaluating the security of AI systems and models. As a demonstration of the framework, we extend the theory-of-mind-based multi-turn Red Queen attack into an Adversarial Language Model (ALM) augmented attack and develop an automated Security Evaluation Test (SET) for discovering jailbreak vulnerabilities in language models. The SET comprises 25 test cases and an Evaluation Language Model (ELM) that determines whether each test case was able to jailbreak the target model, achieving 92% accuracy, an F1-score of 0.91, and a Matthews correlation coefficient of 0.83. We evaluate nine recently released language models of diverse sizes with the SET and find that all are vulnerable to the augmented Red Queen attack to varying degrees. AVISE provides researchers and industry practitioners with an extensible foundation for developing and deploying automated SETs, offering a concrete step toward more rigorous and reproducible AI security evaluation.
翻译:随着人工智能(AI)系统在关键领域的广泛部署,其安全漏洞引发了日益严重的高调利用风险及连锁系统故障。然而,针对AI安全性的系统性评估方法仍不成熟。本文提出AVISE(AI漏洞识别与安全评估)——一个用于识别AI系统与模型漏洞并评估其安全性的模块化开源框架。作为框架的演示,我们将基于心智理论的多轮红皇后攻击扩展为对抗语言模型增强攻击,并开发了自动化安全评估测试(SET)以发现语言模型中的越狱漏洞。该SET包含25个测试用例及一个评估语言模型(ELM),用于判断每个测试用例是否成功越狱目标模型,实现了92%的准确率、0.91的F1分数及0.83的马修斯相关系数。我们利用SET对近期发布的九个不同规模的语言模型进行了评估,发现所有模型在不同程度上均易受增强型红皇后攻击。AVISE为研究人员和行业从业者提供了开发与部署自动化SET的可扩展基础,是推动AI安全性评估更严谨、更可复现的具体一步。