We introduce VIGIL (Visual Inconsistency & Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional & logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: https://github.com/mlubneuskaya/vigil and Data repository: https://huggingface.co/datasets/joannaww/VIGIL.
翻译:我们提出了VIGIL(视觉不一致性与生成式上下文清晰度基准),这是首个针对大型多模态模型在图像再语境化任务中提供细粒度幻觉分类的基准数据集与框架。现有研究通常将幻觉视为单一问题,而我们的工作通过将这些错误分解为五类——粘贴对象幻觉、背景幻觉、对象遗漏、位置与逻辑不一致性以及物理定律违反——填补了多模态评估领域的重要空白。为应对这些复杂问题,我们提出了一种多阶段检测流程。该架构通过一系列针对对象级保真度、背景一致性与遗漏检测的专门步骤处理再语境化图像,并协调利用开源模型集成,其有效性已通过大量实验评估得到验证。我们的方法能够以可解释的方式深入理解模型失败的具体环节,从而填补了该领域的空白,因为此前尚无方法为此任务提供此类分类与分解框架。为促进透明度与进一步探索,我们通过GitHub仓库(https://github.com/mlubneuskaya/vigil)与数据仓库(https://huggingface.co/datasets/joannaww/VIGIL)公开发布了VIGIL基准、检测流程及基准代码。