Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
翻译:在特定任务数据集上微调大语言模型(LLMs)一直是LLMs的主要应用方式。然而,经验观察表明,这种提升能力的方法不可避免地会损害安全性,这一现象也被称为LLM微调中的安全-能力权衡。本文提出了一个理论框架,用于理解两种主要的安全感知LLM微调策略中安全与能力之间的相互作用,为数据相似性、上下文重叠和对齐损失景观的影响提供了新的见解。我们的理论结果刻画了LLM微调中安全-能力权衡的基本极限,这些结果也通过数值实验得到了验证。