With the widespread application of Large Language Models (LLMs), it has become a significant concern to ensure their safety and prevent harmful responses. While current safe-alignment methods based on instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can effectively reduce harmful responses from LLMs, they often require high-quality datasets and heavy computational overhead during model training. Another way to align language models is to modify the logit of tokens in model outputs without heavy training. Recent studies have shown that contrastive decoding can enhance the performance of language models by reducing the likelihood of confused tokens. However, these methods require the manual selection of contrastive models or instruction templates. To this end, we propose Adversarial Contrastive Decoding (ACD), an optimization-based framework to generate two opposite system prompts for prompt-based contrastive decoding. ACD only needs to apply a lightweight prompt tuning on a rather small anchor dataset (< 3 min for each model) without training the target model. Experiments conducted on extensive models and benchmarks demonstrate that the proposed method achieves much better safety performance than previous model training-free decoding methods without sacrificing its original generation ability.
翻译:随着大语言模型(LLMs)的广泛应用,确保其安全性并防止有害响应已成为重要关切。尽管当前基于指令微调和人类反馈强化学习(RLHF)的安全对齐方法能有效减少大语言模型的有害响应,但这些方法通常需要高质量数据集且在模型训练阶段产生高昂计算开销。另一种对齐语言模型的途径是在不进行繁重训练的情况下修改模型输出中词元的逻辑值。近期研究表明,对比解码可通过降低混淆词元的概率来增强语言模型性能,但这类方法需要人工选择对比模型或指令模板。为此,我们提出对抗性对比解码(ACD)——一种基于优化的框架,通过生成两个对立的系统提示来实现基于提示的对比解码。ACD仅需在极小的锚定数据集(每个模型耗时<3分钟)上进行轻量级提示调优,而无需训练目标模型。在多种模型与基准测试上的实验表明,该方法在保持原始生成能力的同时,其安全性能显著优于以往无需模型训练的解码方法。