Existing table understanding methods face challenges due to complex table structures and intricate logical reasoning. While supervised finetuning (SFT) dominates existing research, reinforcement learning (RL), such as Group Relative Policy Optimization (GRPO), has shown promise but struggled with low initial policy accuracy and coarse rewards in tabular contexts. In this paper, we introduce Table-R1, a three-stage RL framework that enhances multimodal table understanding through: (1) Warm-up that prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO), which employs continuous Tree-Edit-Distance Similarity (TEDS) rewards for recognizing table structures and contents, and (3) Hint-Completion GRPO (HC-GRPO), which utilizes fine-grained rewards of residual steps based on the hint-guided question. Extensive experiments demonstrate that Table-R1 can boost the model's table reasoning performance obviously on both held-in and held-out datasets, outperforming SFT and GRPO largely. Notably, Qwen2-VL-7B with Table-R1 surpasses larger specific table understanding models (e.g., Table-LLaVA 13B), even achieving comparable performance to the closed-source model GPT-4o on held-in datasets, demonstrating the efficacy of each stage of Table-R1 in overcoming initialization bottlenecks and reward sparsity, thereby advancing robust multimodal table understanding.
翻译:现有表格理解方法因表格结构复杂及逻辑推理繁琐而面临挑战。尽管监督微调(SFT)主导了现有研究,强化学习(RL)方法(如组相对策略优化(GRPO))虽展现出潜力,但在表格场景中受限于初始策略准确率低与奖励粗糙等问题。本文提出三阶段强化学习框架Table-R1,通过以下机制增强多模态表格理解:(1)预热阶段激发初始感知与推理能力;(2)感知对齐GRPO(PA-GRPO)采用连续树编辑距离相似度(TEDS)奖励识别表格结构与内容;(3)提示补全GRPO(HC-GRPO)基于提示引导问题,利用残余步骤的细粒度奖励。大量实验表明,Table-R1能在保留数据集与未见数据集上显著提升模型的表格推理性能,大幅优于SFT与GRPO。值得注意的是,采用Table-R1的Qwen2-VL-7B模型超越更大型的专用表格理解模型(如Table-LLaVA 13B),甚至在保留数据集上达到与闭源模型GPT-4o相当的性能,证实了Table-R1各阶段在突破初始化瓶颈与奖励稀疏性方面的有效性,从而推动鲁棒多模态表格理解的发展。