Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns -- fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning, has raised a broad research interest among the community. However, as the attack is still new, \textbf{we observe from our miserable submission experience that there are general misunderstandings within the research community.} We in this paper aim to clear some common concerns for the attack setting, and formally establish the research problem. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: \url{https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.}
翻译:近期研究表明,新兴的微调即服务商业模式暴露出严重的安全隐患——用户上传少量有害数据进行微调即可破坏模型的安全对齐。这种被称为有害微调的攻击已引发学术界的广泛研究兴趣。然而,由于该攻击方式仍属新兴领域,**我们从惨痛的投稿经历中发现研究界普遍存在误解**。本文旨在厘清该攻击设定中的常见问题,并正式建立研究框架。具体而言,我们首先阐述该问题的威胁模型,系统介绍有害微调攻击及其变体。随后对现有涉及攻击/防御/机理分析的文献进行系统性综述。最后,我们展望可能推动该领域发展的未来研究方向。此外,我们整理了一份关键问题清单,可供同行评审过程中评审专家质疑实验/攻击/防御设定现实性时参考。相关论文精选列表持续维护于:\url{https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers}