Multimodal large-scale pretraining has shown impressive performance for unstructured data including language, image, audio, and video. However, a prevalent real-world scenario involves the combination of structured data types (tabular, time-series) with unstructured data which has so far been understudied. To bridge this gap, we propose LANISTR, an attention-based framework to learn from LANguage, Image, and STRuctured data. The core of LANISTR's methodology is rooted in \textit{masking-based} training applied across both unimodal and multimodal levels. In particular, we introduce a new similarity-based multimodal masking loss that enables it to learn cross-modal relations from large-scale multimodal data with missing modalities. On two real-world datastes, MIMIC-IV (healthcare) and Amazon Product Review (retail), LANISTR demonstrates remarkable absolute improvements of 6.6\% (AUROC) and up to 14\% (accuracy) when fine-tuned on 0.1\% and 0.01\% of labeled data, respectively, compared to the state-of-the-art alternatives. Notably, these improvements are observed even in the presence of considerable missingness ratios of 35.7\% and 99.8\%, in the respective datasets.
翻译:多模态大规模预训练在语言、图像、音频和视频等非结构化数据上展现出令人瞩目的性能。然而,现实场景中普遍存在的结构化数据类型(表格数据、时间序列)与非结构化数据的结合尚未得到充分研究。为填补这一空白,我们提出LANISTR——一个基于注意力机制的框架,能够从语言、图像和结构化数据中学习。LANISTR方法的核心在于跨单模态和多模态层面应用的基于掩码的训练策略。特别地,我们引入了一种新的基于相似性的多模态掩码损失函数,使其能够从存在模态缺失的大规模多模态数据中学习跨模态关联。在MIMIC-IV(医疗保健)和Amazon Product Review(零售)两个真实数据集上,与当前最优方法相比,LANISTR在仅使用0.1%和0.01%标注数据进行微调时,分别实现了6.6%(AUROC)和高达14%(准确率)的绝对性能提升。值得注意的是,即使面对各数据集分别高达35.7%和99.8%的显著数据缺失率,这些性能提升依然显著。