Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8X) on average while effectively preserving model utility, using only a small amount of retained supervision data.
翻译:多模态大语言模型(MLLMs)可能在预训练阶段记忆敏感的跨模态信息,这使得机器遗忘(MU)技术变得至关重要。现有方法通常基于输出偏差评估遗忘效果,却忽视了遗忘后的生成质量,这容易导致模型产生幻觉化或僵硬的回复,从而影响遗忘后模型的可用性和安全性。为解决这一问题,我们提出ASRU——一种将生成质量作为核心评估目标的可控多模态遗忘框架。ASRU首先通过激活重定向诱发初始拒绝行为,然后利用定制化奖励函数优化细粒度拒绝边界,从而在目标知识遗忘与模型效用之间实现更优的权衡。在Qwen3-VL上的实验表明,ASRU在仅使用少量保留监督数据的情况下,平均遗忘效果提升24.6%,平均生成质量提升5.8倍,同时有效保持了模型效用。