Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.
翻译:由于复杂的布局以及结构-内容信息的紧密耦合,大型视觉语言模型(LVLMs)在表格图像上的推理仍然面临挑战。现有解决方案通常依赖于昂贵的监督训练、强化学习或外部工具,限制了效率和可扩展性。本研究解决一个关键问题:如何在最小化标注且不使用外部工具的情况下,使LVLMs适应表格推理?具体而言,我们首先提出DiSCo——一种分离式结构-内容对齐框架,其在多模态对齐过程中显式地将结构抽象与语义基础分离,从而高效地使LVLMs适应表格结构。基于DiSCo,我们进一步提出Table-GLS——一种全局到局部结构引导的推理框架,通过结构化探索和基于证据的推理执行表格推理。在多样化基准测试上的大量实验表明,我们的框架能有效提升LVLM的表格理解与推理能力,尤其能够泛化至未见过的表格结构。