Unlearning in Multimodal Large Language Models (MLLMs) prevents the model from revealing private information when queried about target images. Existing MLLM unlearning methods largely adopt approaches developed for LLMs. They treat all answer tokens uniformly, disregarding their varying importance in the unlearning process. Moreover, these methods focus exclusively on the language modality, disregarding visual cues that indicate key tokens in answers. In this paper, after formulating the problem of unlearning in multimodal question answering for MLLMs, we propose Visual-Guided Key-Token Regularization (ViKeR). We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions and use these distributions to regularize the unlearning process, thereby prioritizing key tokens. Further, we define key tokens in unlearning via information entropy and discuss ViKeR's effectiveness through token-level gradient reweighting, which amplifies updates on key tokens. Experiments on MLLMU and CLEAR benchmarks demonstrate that our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.
翻译:多模态大语言模型(MLLMs)中的遗忘旨在防止模型在被查询目标图像时泄露隐私信息。现有的MLLM遗忘方法大多沿袭为LLMs开发的技术,它们对所有答案令牌进行统一处理,忽视了这些令牌在遗忘过程中重要性的差异。此外,这些方法仅关注语言模态,忽略了指示答案中关键令牌的视觉线索。本文在形式化了MLLMs多模态问答中的遗忘问题后,提出了视觉引导的关键令牌正则化(ViKeR)。我们利用无关的视觉输入来预测理想的遗忘后令牌级分布,并使用这些分布来正则化遗忘过程,从而优先处理关键令牌。此外,我们通过信息熵定义了遗忘中的关键令牌,并通过令牌级梯度重加权(该机制放大了对关键令牌的更新)讨论了ViKeR的有效性。在MLLMU和CLEAR基准上的实验表明,我们的方法能有效执行遗忘,同时减轻遗忘效应并保持回答的连贯性。