Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.
翻译:多模态大语言模型在过去几年中备受关注,但其生成的描述中可能包含对应图像中不存在的物体,这种现象被称为物体幻觉。为消除幻觉,现有方法需人工标注含幻觉与不含幻觉的配对响应,并通过多种对齐算法提升图像与文本的匹配能力。然而,这些方法不仅要求微调阶段消耗大量计算资源,还需昂贵的人工标注来构建对齐算法所需的配对数据。为解决上述问题,我们借鉴"遗忘"思想,提出高效细粒度遗忘框架(EFUF),该框架无需配对数据即可消除幻觉。大量实验表明,本方法能在保持生成质量的同时,以适度的计算开销持续降低幻觉率。我们的代码与数据集将公开发布。