Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising to create new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios, without considering the missing data issue, and thus are limited in practice. In this paper, we propose TIP, a novel tabular-image pre-training framework for learning multimodal representations robust to incomplete tabular data. Specifically, TIP investigates a novel self-supervised learning (SSL) strategy, including a masked tabular reconstruction task for tackling data missingness, and image-tabular matching and contrastive learning objectives to capture multimodal information. Moreover, TIP proposes a versatile tabular encoder tailored for incomplete, heterogeneous tabular data and a multimodal interaction module for inter-modality representation learning. Experiments are performed on downstream multimodal classification tasks using both natural and medical image datasets. The results show that TIP outperforms state-of-the-art supervised/SSL image/multimodal algorithms in both complete and incomplete data scenarios. Our code is available at https://github.com/siyi-wind/TIP.
翻译:图像与结构化表格是现实世界数据库的重要组成部分。尽管表格-图像表示学习有望创造新的见解,但其仍是一项具有挑战性的任务,因为表格数据通常具有异构性和不完整性,与图像存在显著的模态差异。先前的研究主要集中于完整数据场景下的简单模态融合策略,未考虑数据缺失问题,因此在实践中存在局限。本文提出TIP,一种新颖的表格-图像预训练框架,用于学习对不完整表格数据具有鲁棒性的多模态表示。具体而言,TIP研究了一种新颖的自监督学习策略,包括用于处理数据缺失的掩码表格重建任务,以及用于捕获多模态信息的图像-表格匹配与对比学习目标。此外,TIP提出了一种专为不完整、异构表格数据设计的通用表格编码器,以及一个用于跨模态表示学习的多模态交互模块。实验在自然图像和医学图像数据集上,针对下游多模态分类任务进行。结果表明,无论在完整还是不完整数据场景下,TIP均优于当前最先进的监督式/自监督学习图像/多模态算法。我们的代码公开于 https://github.com/siyi-wind/TIP。