Machine unlearning, a novel area within artificial intelligence, focuses on addressing the challenge of selectively forgetting or reducing undesirable knowledge or behaviors in machine learning models, particularly in the context of large language models (LLMs). This paper introduces a methodology to align LLMs, such as Open Pre-trained Transformer Language Models, with ethical, privacy, and safety standards by leveraging the gradient ascent algorithm for knowledge unlearning. Our approach aims to selectively erase or modify learned information in LLMs, targeting harmful responses and copyrighted content. This paper presents a dual-pronged approach to enhance the ethical and safe behavior of large language models (LLMs) by addressing the issues of harmful responses and copyrighted content. To mitigate harmful responses, we applied gradient ascent on the PKU dataset, achieving a 75\% reduction in harmful responses for Open Pre-trained Transformer Language Models (OPT1.3b and OPT2.7b) \citet{zhang2022opt} while retaining previous knowledge using the TruthfulQA dataset \citet{DBLP:journals/corr/abs-2109-07958}. For handling copyrighted content, we constructed a custom dataset based on the Lord of the Rings corpus and aligned LLMs (OPT1.3b and OPT2.7b) \citet{zhang2022opt} through LoRA: Low-Rank Adaptation of Large Language Models \citet{DBLP:journals/corr/abs-2106-09685} finetuning. Subsequently, we employed gradient ascent to unlearn the Lord of the Rings content, resulting in a remarkable reduction in the presence of copyrighted material. To maintain a diverse knowledge base, we utilized the Book Corpus dataset. Additionally, we propose a new evaluation technique for assessing the effectiveness of harmful unlearning.
翻译:机器遗忘作为人工智能领域的一个新兴研究方向,致力于解决机器学习模型(尤其是大语言模型)中有选择性地遗忘或削弱不良知识与行为这一挑战。本文提出一种方法,通过利用梯度上升算法进行知识遗忘,使大语言模型(如Open Pre-trained Transformer Language Models)符合伦理、隐私与安全标准。该方法旨在有选择地擦除或修改大语言模型中已习得的信息,重点关注有害响应与受版权保护内容。本文提出一种双管齐下的方法,通过处理有害响应与版权内容问题来增强大语言模型的伦理与安全行为。为减少有害响应,我们在PKU数据集上应用梯度上升算法,使Open Pre-trained Transformer Language Models(OPT1.3b与OPT2.7b)\citet{zhang2022opt}的有害响应降低75%,同时利用TruthfulQA数据集\citet{DBLP:journals/corr/abs-2109-07958}保留原有知识。针对版权内容处理,我们基于《指环王》语料构建定制数据集,并通过LoRA:大语言模型的低秩自适应\citet{DBLP:journals/corr/abs-2106-09685}微调对齐大语言模型(OPT1.3b与OPT2.7b)\citet{zhang2022opt}。随后采用梯度上升算法对《指环王》内容进行遗忘,显著降低了受版权保护材料的出现。为保持知识多样性,我们使用了Book Corpus数据集。此外,本文还提出一种评估有害遗忘效果的新颖评测技术。