Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.
翻译:大型多模态模型(LMMs)常面临多模态幻觉问题,即可能生成视觉输入中不存在的虚假内容。本文从新视角探索该问题:过详细的训练数据阻碍模型及时终止生成,导致输出内容超出视觉感知范围。通过研究模型如何利用特殊句尾标记EOS决策终止生成,我们发现模型通过比较已生成文本与图像来评估整个序列的完整性。这一观察表明,模型具备基于视觉感知做出恰当EOS决策的固有潜力,从而避免输出过度冗长。为发挥该潜力,我们探索两种缓解多模态幻觉的方法:一种训练目标使模型通过从常规指令数据中学习来减少幻觉,另一种数据过滤策略则防止有害训练数据加剧模型幻觉。两种方法无需额外数据或知识,即可显著改善LMMs的幻觉性能。