Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. Recent methods employ Multi-modal Large Language Models (MLLMs) to address table-related tasks across various modalities of table representations. However, existing studies mainly focus on exploring the table understanding ability of MLLMs using unimodal representations, which limits further exploration of multi-modal representations to enable more effective table reasoning. To better capture structural semantics from the tabular data, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, optimizing MLLMs by learning more comprehensive table information from these multiple modalities. Specifically, HIPPO samples MLLM responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during Direct Preference Optimization (DPO) training. Experiments on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances the table reasoning capability based on unimodal representations but also facilitates the extraction of complementary semantics across modalities. The code is available at https://github.com/NEUIR/HIPPO.
翻译:表格数据蕴含丰富的结构语义,在信息组织与处理中扮演着关键角色。现有方法多采用多模态大语言模型(MLLMs)来处理基于不同表格表示模态的相关任务。然而,当前研究主要集中于探索MLLMs在单模态表格表示下的理解能力,这限制了对多模态表示的进一步探索,以支持更有效的表格推理。为更好地从表格数据中捕捉结构语义,本文提出混合模态偏好优化(HybrId-modal Preference oPtimizatiOn, HIPPO)模型,该模型同时采用文本与图像表示表格,通过从多模态中学习更全面的表格信息来优化MLLMs。具体而言,HIPPO从混合模态表格表示中采样MLLM响应,并设计了一种模态一致采样策略,以增强响应多样性并在直接偏好优化(DPO)训练过程中减轻模态偏差。在表格问答与表格事实核查任务上的实验验证了HIPPO的有效性,相比多种表格推理模型取得了4%的性能提升。进一步分析表明,HIPPO不仅提升了基于单模态表示的表格推理能力,还促进了跨模态互补语义的提取。代码已发布于 https://github.com/NEUIR/HIPPO。