The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .
翻译:大型语言模型(LLMs)对抗对抗性操纵(如越狱攻击)的鲁棒性仍然是一个重大挑战。在本工作中,我们提出一种方法,通过引入外部批判模型来增强LLM的自批判能力,并进一步在净化后的合成数据上进行微调。该外部批判模型可与原始模型融合,从而强化自批判能力,并提升LLM对对抗性提示的响应鲁棒性。我们的结果表明,融合与自批判相结合能显著降低攻击者的攻击成功率,为抵御越狱攻击提供了一种有前景的防御机制。代码、数据及模型发布于 https://github.com/vicgalle/merging-self-critique-jailbreaks。