Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats -- typically a multiple-choice response regarding a particular object or attribute -- which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we propose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.
翻译:缓解大型视觉语言模型(LVLM)中的幻觉问题仍是一个未解决的难题。现有基准测试未能解决开放式自由形式响应中的幻觉(我们称为“I类幻觉”),而是聚焦于针对特定问题格式(通常是对特定对象或属性的多项选择响应)的幻觉(我们称为“II类幻觉”)。此外,这类基准测试通常需要调用可能发生变化的外部API模型。实践中我们发现,减少II类幻觉并不会导致I类幻觉的减少,反而这两种幻觉形式往往呈现负相关。为此,我们提出THRONE——一种新颖的基于对象的自动化框架,用于定量评估LVLM自由形式输出中的I类幻觉。我们使用公开语言模型(LM)识别LVLM响应中的幻觉,并计算信息量丰富的指标。通过使用公开数据集评估大量近期LVLM模型,我们证明现有指标的改进并未减少I类幻觉,且现有用于测量I类幻觉的基准测试并不完整。最后,我们提供了一种简单有效的数据增强方法作为强基线,可同时减少I类与II类幻觉。