Recent work has demonstrated that fine-tuning is a promising approach to `unlearn' concepts from large language models. However, fine-tuning can be expensive, as it requires both generating a set of examples and running iterations of fine-tuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to fine-tuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive fine-tuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. fine-tuning, and highlights scenarios where guardrails themselves may be advantageous for unlearning, such as in generating examples for fine-tuning or unlearning when only API access is available.
翻译:近期研究表明,微调是实现大型语言模型"遗忘"概念的一种有效方法。然而,微调成本高昂,既需要生成示例数据集,又需运行多轮迭代更新模型参数。本研究发现,提示工程与过滤等简易防护栏方法,能够取得与微调相当的遗忘效果。我们建议研究者在评估计算密集型的微调方法性能时,首先考察这些轻量级基线方案。尽管我们并不声称提示或过滤等方法是解决遗忘问题的通用方案,但本研究表明:我们需要能更好区分防护栏与微调效能差异的评估指标;同时,在仅能通过API访问模型、或为微调生成训练样本等场景中,防护栏方法本身就具有独特优势。