Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.
翻译:指代表达定位是视觉定位的核心问题,常被用作视觉与语言模型中空间定位与推理能力的诊断手段,但以往研究多聚焦于自然图像。相比之下,现有图表指代表达定位基准存在以下局限性:(1)主要采用边界框标注,限制了细粒度图表元素的定位精度;(2)通常假设单数或双数指代目标实例,难以处理多实例目标指代;(3)语言表达过度依赖文本线索或数据排序线索;(4)仅涵盖少量图表类型。为解决这些问题,我们提出一个系统性地支持多种定位形式、多指代目标、多样化指代线索与多图表类型的图表指代表达定位基准。对代表性多模态大模型的评估揭示了显著的性能差距。我们进一步提出一种代码驱动的合成流水线,利用绘图程序与渲染图表基元之间的内在对齐关系,跨图表元素类型与粒度生成像素级精确实例掩码。我们利用合成掩码训练实例分割模型,并将其集成到通用多模态定位框架中。该系统在基准上持续优于基线,并能在ChartQA导出的真实图表定位基准上表现出良好泛化性。