Safety alignment is the key to guiding the behaviors of large language models (LLMs) that are in line with human preferences and restrict harmful behaviors at inference time, but recent studies show that it can be easily compromised by finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": randomly perturbing model weights maintains the safety level of the original aligned model in its local neighborhood. Our discovery inspires us to propose the new VISAGE safety metric that measures the safety in LLM finetuning by probing its safety landscape. Visualizing the safety landscape of the aligned model enables us to understand how finetuning compromises safety by dragging the model away from the safety basin. LLM safety landscape also highlights the system prompt's critical role in protecting a model, and that such protection transfers to its perturbed variants within the safety basin. These observations from our safety landscape research provide new insights for future work on LLM safety community.
翻译:安全对齐是引导大型语言模型(LLMs)行为符合人类偏好并在推理时限制有害行为的关键,但近期研究表明,仅需少量对抗性设计的训练样本进行微调即可轻易破坏其安全性。本研究旨在通过探索LLM安全格局来量化微调过程中的风险。我们在主流开源LLMs的模型参数空间中普遍观察到一种新现象,称之为“安全盆地”:在局部邻域内随机扰动模型权重仍能保持原始对齐模型的安全水平。这一发现启发我们提出新的VISAGE安全度量标准,通过探测安全格局来评估LLM微调的安全性。可视化对齐模型的安全格局使我们能够理解微调如何通过将模型拖离安全盆地而破坏安全性。LLM安全格局同时揭示了系统提示词在保护模型方面的关键作用,且这种保护可转移至安全盆地内的扰动变体。基于安全格局研究的这些发现,为LLM安全领域的未来工作提供了新的见解。