Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. Our code is available at https://davidmchan.github.io/aloha/.
翻译:尽管近期多模态预训练在视觉描述方面取得了进展,最先进的模型仍会生成包含错误的描述,例如幻觉出场景中不存在的物体。现有物体幻觉评估指标CHAIR仅局限于MS COCO数据集的固定物体集合及其同义词。本研究提出了一种现代化开放词汇指标ALOHa,通过利用大型语言模型(LLMs)来评估物体幻觉。具体而言,我们使用LLM从候选描述中提取可指代物体,通过语义相似度与参考描述及物体检测结果中的物体进行度量,并采用匈牙利匹配算法生成最终幻觉分数。实验表明,在用于幻觉标注的新MS COCO Captions黄金标准子集HAT上,ALOHa比CHAIR能正确识别多13.6%的幻觉物体;在物体类别超出MS COCO范畴的nocaps数据集上,该比例提升至30.8%。我们的代码已开源在https://davidmchan.github.io/aloha/。