Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are susceptible to hallucinations, especially assertively fabricating content not present in the visual inputs. To address the aforementioned challenge, we follow a common cognitive process - when one's initial memory of critical on-sight details fades, it is intuitive to look at them a second time to seek a factual and accurate answer. Therefore, we introduce Memory-space Visual Retracing (MemVR), a novel hallucination mitigation paradigm that without the need for external knowledge retrieval or additional fine-tuning. In particular, we treat visual prompts as supplementary evidence to be reinjected into MLLMs via Feed Forward Network (FFN) as key-value memory, when the model is uncertain or even amnesic about question-relevant visual memories. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks without incurring added time overhead, thus emphasizing its potential for widespread applicability.
翻译:尽管多模态大语言模型(MLLMs)展现出卓越的能力,但其易产生幻觉,尤其会武断地编造视觉输入中不存在的内容。为应对上述挑战,我们借鉴了一种常见的认知过程——当个体对关键现场细节的初始记忆逐渐模糊时,直觉上会进行二次审视以寻求事实性准确答案。为此,我们提出了记忆空间视觉回溯(MemVR),这是一种无需外部知识检索或额外微调的新型幻觉缓解范式。具体而言,当模型对问题相关视觉记忆存在不确定性甚至遗忘时,我们将视觉提示作为补充证据,通过前馈网络(FFN)以键值对记忆的形式重新注入MLLMs。综合实验评估表明,MemVR能显著缓解多种MLLMs的幻觉问题,在通用基准测试中表现优异,且未增加时间开销,从而凸显了其广泛应用的潜力。