Modeling semantic and structural information from tabular data remains a core challenge for effective table understanding. Existing Table-as-Text approaches flatten tables for large language models (LLMs), but lose crucial structural cues, while Table-as-Image methods preserve structure yet struggle with precise semantics. Recent Table-as-Multimodality strategies attempt to combine textual and visual views, but they (1) statically process both modalities for every query-table pair within large multimodal LLMs (MLLMs), inevitably introducing redundancy and even conflicts, and (2) depend on costly fine-tuning of MLLMs. In light of this, we propose TableDART, a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models. TableDART introduces a lightweight 2.59M-parameter MLP gating network that dynamically selects the optimal path (Text-only, Image-only, or Fusion) for each table-query pair, reducing redundancy and avoiding conflicts that arise when textual and visual views of the same table provide inconsistent cues. By routing to the most appropriate view, our framework improves both accuracy and efficiency. In addition, we propose a novel agent to mediate cross-modal knowledge integration by analyzing outputs from text- and image-based models, either selecting the best result or synthesizing a new answer through reasoning. This design avoids the prohibitive costs of full MLLM fine-tuning. Extensive experiments on seven benchmarks show that TableDART establishes new state-of-the-art performance among open-source models, surpassing the strongest baseline by an average of 4.02%. The code is available at: https://github.com/xiaobo-xing/TableDART.
翻译:从表格数据中建模语义和结构信息仍是实现有效表格理解的核心挑战。现有的"表格即文本"方法将表格展平后输入大型语言模型(LLM),但丢失了关键的结构线索;而"表格即图像"方法虽能保持结构,却难以处理精确语义。近期的"表格即多模态"策略尝试结合文本与视觉视图,但它们存在两个问题:(1)对每个查询-表格对都在大型多模态LLM(MLLM)中静态处理两种模态,不可避免地引入冗余甚至冲突;(2)依赖成本高昂的MLLM微调。为此,我们提出TableDART——一个通过复用预训练单模态模型来集成多模态视图的高效训练框架。TableDART引入了一个轻量级的259万参数MLP门控网络,为每个表格-查询对动态选择最优路径(纯文本、纯图像或融合路径),从而减少冗余并避免因同一表格的文本视图与视觉视图提供不一致线索而产生的冲突。通过路由至最合适的视图,我们的框架同时提升了准确性与效率。此外,我们设计了一种新型智能体,通过分析基于文本和基于图像的模型输出,或选择最佳结果或通过推理合成新答案,以此协调跨模态知识整合。该设计避免了完整MLLM微调的过高成本。在七个基准数据集上的大量实验表明,TableDART在开源模型中取得了新的最优性能,平均超越最强基线4.02%。代码已开源:https://github.com/xiaobo-xing/TableDART。