The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods. The source code is available at https://github.com/snu-mllab/Bayesian-Red-Teaming.
翻译:大规模生成模型的部署通常受到其可能以不可预测的方式对用户造成伤害的潜在风险限制。我们聚焦于黑盒红队测试问题,其中红队生成测试用例并与受害模型交互,在有限的查询次数限制下发现多样化的失败模式。现有红队测试方法依赖人工监督或语言模型(LM)构建测试用例,并以暴力方式查询所有测试用例,未利用历史评估信息,导致查询次数过高。为此,我们提出贝叶斯红队测试(BRT),这是一种基于贝叶斯优化的新型查询高效黑盒红队测试方法,通过利用预定义的用户输入池和历史评估信息,迭代识别导致模型失败的多样化正向测试用例。针对不同用户输入池的实验结果表明,在有限的查询预算下,我们的方法比基线方法能始终发现显著更多且更多样化的正向测试用例。源代码已公开于 https://github.com/snu-mllab/Bayesian-Red-Teaming。