Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns -- fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning, has raised a broad research interest among the community. However, as the attack is still new, \textbf{we observe from our miserable submission experience that there are general misunderstandings within the research community.} We in this paper aim to clear some common concerns for the attack setting, and formally establish the research problem. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: \url{https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.}

翻译：近期研究表明，新兴的微调即服务商业模式暴露出严重的安全隐患——用户上传少量有害数据进行微调即可破坏模型的安全对齐。这种被称为有害微调的攻击已引发学术界的广泛研究兴趣。然而，由于该攻击方式仍属新兴领域，**我们从惨痛的投稿经历中发现研究界普遍存在误解**。本文旨在厘清该攻击设定中的常见问题，并正式建立研究框架。具体而言，我们首先阐述该问题的威胁模型，系统介绍有害微调攻击及其变体。随后对现有涉及攻击/防御/机理分析的文献进行系统性综述。最后，我们展望可能推动该领域发展的未来研究方向。此外，我们整理了一份关键问题清单，可供同行评审过程中评审专家质疑实验/攻击/防御设定现实性时参考。相关论文精选列表持续维护于：\url{https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers}

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/