Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Recent advancements in large vision-language models (LVLMs) have led to significant progress in generating natural language descriptions for visual content and thus enhancing various applications. One issue with these powerful models is that they sometimes produce texts that are factually inconsistent with the visual input. While there has been some effort to mitigate such inconsistencies in natural image captioning, the factuality of generated captions for structured document images, such as charts, has not received as much scrutiny, posing a potential threat to information reliability in critical applications. This work delves into the factuality aspect by introducing a comprehensive typology of factual errors in generated chart captions. A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models, ultimately forming the foundation of a novel dataset, CHOCOLATE. Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies. In response to this challenge, we establish the new task of Chart Caption Factual Error Correction and introduce CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency. Furthermore, we propose C2TFEC, an interpretable two-stage framework that excels at correcting factual errors. This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

翻译：近期大视觉语言模型（LVLMs）的进展显著提升了为视觉内容生成自然语言描述的能力，从而推动了各类应用的发展。这些强大模型的一个问题是，它们有时会生成与视觉输入事实不符的文本。尽管在自然图像描述中已有一些缓解此类不一致性的工作，但对于结构化文档图像（如图表）生成描述的事实性，尚未得到同等程度的关注，这给关键应用中的信息可靠性带来了潜在威胁。本文通过引入一套详尽的图表描述事实性错误分类体系，深入探究了事实性问题。大规模人工标注工作揭示了各种图表描述模型所生成描述中的错误模式与频率，最终构建了一个全新数据集CHOCOLATE。我们的分析表明，即便是GPT-4V等最先进模型，也频繁生成带有事实性错误的描述。针对这一挑战，我们确立了图表描述事实性错误纠正这一新任务，并提出了CHARTVE——一种在评估事实一致性方面优于专有及开源LVLMs的视觉蕴含模型。此外，我们提出了C2TFEC——一个可解释的两阶段框架，能够高效纠正事实性错误。本文开创了图表描述事实性错误纠正的新领域，提出了新颖的评估机制，并展示了一种确保生成图表描述事实性的有效方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/