With the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.
翻译:随着语言模型在不同应用中的近期兴起,这些模型的安全性和鲁棒性受到广泛关注。本文提出了一种联合框架,通过对抗性提示和基于迭代反馈循环的信念增强,同时探测并提升黑盒目标模型的鲁棒性。该框架采用自动化红队方法探测目标模型,同时结合信念增强器为目标模型生成指令,以提升其对对抗性探测的鲁棒性。重要的是,对抗性模型和信念生成器利用先前交互的反馈,分别提升对抗性提示和信念的有效性。实验表明,该框架在敌手直接与目标模型交互的动态场景中,以及使用静态基准数据集评估模型的静态场景中,均能有效减少有毒内容的生成。