Recent advancements in large vision-language models (LVLMs) have led to significant progress in generating natural language descriptions for visual content and thus enhancing various applications. One issue with these powerful models is that they sometimes produce texts that are factually inconsistent with the visual input. While there has been some effort to mitigate such inconsistencies in natural image captioning, the factuality of generated captions for structured document images, such as charts, has not received as much scrutiny, posing a potential threat to information reliability in critical applications. This work delves into the factuality aspect by introducing a comprehensive typology of factual errors in generated chart captions. A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models, ultimately forming the foundation of a novel dataset, CHOCOLATE. Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies. In response to this challenge, we establish the new task of Chart Caption Factual Error Correction and introduce CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency. Furthermore, we propose C2TFEC, an interpretable two-stage framework that excels at correcting factual errors. This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions. The code and data as well as the continuously updated benchmark can be found at: https://khuangaf.github.io/CHOCOLATE/.
翻译:近年来,大型视觉语言模型(LVLMs)的发展在生成视觉内容的自然语言描述方面取得了显著进展,从而推动了多种应用的增强。这些强大模型存在的一个问题是,它们有时生成的文本与视觉输入在事实上不一致。尽管在自然图像描述领域已有一些努力来缓解此类不一致性,但针对结构化文档图像(如图表)生成描述的事实性尚未受到足够审视,这对关键应用中的信息可靠性构成了潜在威胁。本研究通过引入一个全面的图表描述事实性错误分类体系,深入探讨了事实性这一维度。一项大规模的人工标注工作揭示了多种图表描述模型所生成描述中的错误模式与频率,最终构成了新数据集CHOCOLATE的基础。我们的分析表明,即使是包括GPT-4V在内的最先进模型,也经常产生夹杂事实性错误的描述。为应对这一挑战,我们设立了“图表描述事实性错误校正”这一新任务,并提出了CHARTVE模型——一种在评估事实一致性方面优于专有及开源LVLMs的视觉蕴含模型。此外,我们提出了C2TFEC,一个可解释的两阶段框架,在纠正事实性错误方面表现卓越。本研究开创了图表描述事实性错误校正的新领域,提出了一种新颖的评估机制,并展示了一种确保生成图表描述事实性的有效方法。代码、数据及持续更新的基准测试可在以下网址找到:https://khuangaf.github.io/CHOCOLATE/。