Large Vision Language Models (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) Language Models in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a reliable detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
翻译:大型视觉语言模型(LVLMs)在视觉问答和图像描述生成等多模态任务中展现出卓越的能力。然而,视觉信息与生成文本之间的不一致性(即幻觉现象)仍然是影响LVLMs可信度的未解难题。针对此问题,近期研究提出引入计算成本高昂的大型(视觉)语言模型来检测句子或子句级别的幻觉。本文提出MetaToken,一种轻量级二元分类器,能以可忽略的成本在词元级别检测幻觉。基于统计分析,我们揭示了以往研究中被忽视的LVLMs幻觉关键成因。MetaToken无需任何真实数据先验知识即可应用于任何开源LVLM,提供可靠的幻觉检测。我们在四种前沿LVLM上评估了该方法,验证了其有效性。