Vision-Language Models (VLMs) have demonstrated remarkable progress in chart understanding, largely driven by supervised fine-tuning (SFT) on increasingly large synthetic datasets. However, scaling SFT data alone is inefficient and overlooks a key property of charts: charts are programmatically generated visual artifacts, where small, code-controlled visual changes can induce drastic shifts in semantics and correct answers. Learning this counterfactual sensitivity requires VLMs to discriminate fine-grained visual differences, yet standard SFT treats training instances independently and provides limited supervision to enforce this behavior. To address this, we introduce ChartCF, a data-efficient training framework designed to enhance counterfactual sensitivity. ChartCF consists of: (1) a counterfactual data synthesis pipeline via code modification, (2) a chart similarity-based data selection strategy that filters overly difficult samples for improved training efficiency, and (3) multimodal preference optimization across both textual and visual modalities. Experiments on five benchmarks show that ChartCF achieves superior or comparable performance to strong chart-specific VLMs while using significantly less training data.
翻译:视觉-语言模型(VLM)在图表理解领域取得了显著进展,这主要得益于在规模日益增大的合成数据集上进行的监督微调(SFT)。然而,单纯扩展SFT数据的效率低下,且忽略了图表的一个关键特性:图表是由程序生成的视觉制品,其中微小的、由代码控制的视觉变化,可能引发语义和正确答案的剧烈转变。学习这种反事实敏感性,要求VLM能够辨别细微的视觉差异,但标准的SFT将训练实例独立处理,未能提供足够的监督来强化这一能力。为解决此问题,我们提出了ChartCF,一个旨在增强反事实敏感性的数据高效训练框架。ChartCF包含:(1)通过代码修改实现的反事实数据合成流程;(2)基于图表相似性的数据选择策略,用于过滤过于困难的样本,提升训练效率;(3)跨文本与视觉模态的多模态偏好优化。在五个基准上的实验表明,与强大的专用图表VLM相比,ChartCF在使用更少训练数据的情况下,实现了相当或更优的性能。