This work investigates the application of Machine Unlearning (MU) for mitigating the impact of trojans embedded in conventional large language models of natural language (Text-LLMs) and large language models of code (Code-LLMs) We propose a novel unlearning approach, LYA, that leverages both gradient ascent and elastic weight consolidation, a Fisher Information Matrix (FIM) based regularization technique, to unlearn trojans from poisoned models. We compare the effectiveness of LYA against conventional techniques like fine-tuning, retraining, and vanilla gradient ascent. The subject models we investigate are BERT and CodeBERT, for sentiment analysis and code defect detection tasks, respectively. Our findings demonstrate that the combination of gradient ascent and FIM-based regularization, as done in LYA, outperforms existing methods in removing the trojan's influence from the poisoned model, while preserving its original functionality. To the best of our knowledge, this is the first work that compares and contrasts MU of trojans in LLMs, in the NL and Coding domain.
翻译:本研究探讨了机器学习遗忘(MU)在减轻传统自然语言大语言模型(Text-LLMs)与代码大语言模型(Code-LLMs)中嵌入木马影响方面的应用。我们提出了一种新颖的遗忘方法LYA,该方法结合梯度上升与基于费舍尔信息矩阵(FIM)的正则化技术——弹性权重巩固,以从中毒模型中遗忘木马。我们将LYA与微调、重新训练及原始梯度上升等传统技术的有效性进行了比较。研究所用的主体模型分别为用于情感分析任务的BERT和用于代码缺陷检测任务的CodeBERT。研究结果表明,如LYA所采用的梯度上升与基于FIM的正则化相结合的方法,在消除中毒模型中木马影响的同时保持其原始功能方面,优于现有方法。据我们所知,这是首个在自然语言与代码领域对LLMs中木马遗忘进行比较与对比的研究工作。