Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments. On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM.
翻译:幻觉是大型视觉语言模型在生成长文本时普遍存在且难以根除的问题。带有幻觉的生成内容与图像内容存在部分不一致。为缓解幻觉,现有研究或聚焦于模型推理过程,或关注模型生成结果,但其设计的解决方案有时无法恰当处理各类查询及针对这些查询生成的幻觉内容。为精准应对多种幻觉类型,我们提出了一个统一的幻觉缓解框架Dentist。其核心步骤是首先对查询进行分类,然后基于分类结果执行不同的幻觉缓解流程,正如牙医先观察牙齿再制定治疗计划。在简单部署中,Dentist可将查询分类为感知型或推理型,并有效缓解答案中的潜在幻觉,这一点已在实验中得以验证。在MMbench基准测试中,我们在图像质量评估(一种粗粒度感知视觉问答任务)上的准确率较基线模型InstructBLIP/LLaVA/VisualGLM分别提升了13.44%/10.2%/15.8%。