The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.
翻译:近期视觉语言模型(VLM)的进展表明,实现真正智能需要具备强大的推理能力。在图表问答(CQA)这类涉及复杂数据可视化的任务中,语言推理必须与视觉理解深度融合,而不仅仅停留在模式识别层面。当前VLM在CQA任务中存在显著局限,包括数值提取不精确、难以解析隐含的视觉关系,以及注意力机制无法有效捕捉图表中的空间关联。针对这些问题,本文提出Chart-RL——一种新颖的强化学习框架,通过基于反馈的策略优化机制增强VLM的视觉感知与逻辑推理能力,从而提升图表理解性能。我们的核心创新在于构建了融合策略优化强化学习与自适应奖励函数的综合框架,其性能不仅超越基础基线模型,更可与更大规模的先进架构相竞争。同时,我们在强化学习框架中集成了基于低秩自适应(LoRA)的参数高效微调方法,仅需单GPU配置即可保持完整的模型性能。基于ChartQAPro数据集,我们与开源、商业及先进闭源模型进行了广泛基准测试。经强化学习微调的Qwen3-VL-4B-Instruct模型在参数量减半的情况下,答案准确率达0.634,超越Qwen3-VL-8B-Instruct基础模型的0.580准确率,同时将推理延迟从31秒降低至9秒。