While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
翻译:尽管大型视觉语言模型(LVLMs)在文本层面的推理与自我修正方面展现出卓越能力,但这些优势对于以视觉感知为核心的复杂任务(如图表解析)助益甚微。现有模型在处理视觉密集的图表时往往表现不佳,导致数据遗漏、错位及幻觉等错误。受人类在阅读复杂图表时使用手指作为“视觉锚点”以确保准确性的策略启发,我们提出了一种名为视觉自精炼(VSR)的新范式。VSR的核心思想是使模型能够生成像素级定位输出,将其可视化,然后将这些可视化结果反馈给模型自身,使其能够直观地检查并修正自身潜在的视觉感知错误。我们在图表解析领域通过提出ChartVSR模型来实例化VSR范式。该模型将解析过程分解为两个阶段:精炼阶段,模型迭代利用视觉反馈以确保所有数据点像素级定位的准确性;解码阶段,模型使用这些经过验证的定位作为精确的视觉锚点来解析最终的结构化数据。为弥补现有基准测试的不足,我们还构建了ChartP-Bench——一个全新且极具挑战性的图表解析基准。我们的工作同时凸显了VSR作为一种通用视觉反馈机制的潜力,为提升广泛视觉中心任务的准确性开辟了一个前景广阔的新方向。