Training multimodal foundation models is challenging due to the limited availability of multimodal datasets. While many public datasets pair images with text, few combine images with audio or text with audio. Even rarer are datasets that align all three modalities at once. Critical domains such as healthcare, infrastructure, or transportation are particularly affected by missing modalities. This makes it difficult to integrate all modalities into a large pre-trained neural network that can be used out-of-the-box or fine-tuned for different downstream tasks. We introduce LoReTTa (Linking mOdalities with a tRansitive and commutativE pre-Training sTrAtegy) to address this understudied problem. Our self-supervised framework unifies causal modeling and masked modeling with the rules of commutativity and transitivity. This allows us to transition within and between modalities. As a result, our pre-trained models are better at exploring the true underlying joint probability distribution. Given a dataset containing only the disjoint combinations (A, B) and (B, C), LoReTTa can model the relation A <-> C with A <-> B <-> C. In particular, we show that a transformer pre-trained with LoReTTa can handle any mixture of modalities at inference time, including the never-seen pair (A, C) and the triplet (A, B, C). We extensively evaluate our approach on a synthetic, medical, and reinforcement learning dataset. Across different domains, our universal multimodal transformer consistently outperforms strong baselines such as GPT, BERT, and CLIP on tasks involving the missing modality tuple.
翻译:训练多模态基础模型面临挑战,根源在于多模态数据集的稀缺性。虽然许多公共数据集将图像与文本配对,但鲜有数据集能同时关联图像与音频或文本与音频。更罕见的是能同时对齐三种模态的数据集。医疗、基础设施、交通等关键领域尤其受模态缺失问题困扰,这使得将全部模态整合至可即用或微调的大型预训练神经网络变得困难。我们提出LoReTTa(通过传递性与交换性预训练策略链接模态)来解决这一研究不足的问题。该自监督框架将因果建模与掩码建模统一于交换律和传递律规则之下,使模型能够实现模态内部及跨模态的转换。由此,我们的预训练模型能更深入地探索真实的联合概率分布。给定仅包含非重叠组合(A,B)和(B,C)的数据集,LoReTTa可通过A<->B<->C的链条建模A<->C的关系。特别值得注意的是,经LoReTTa预训练的Transformer在推理时可处理任意模态组合,包括未见过的(A,C)对及(A,B,C)三元组。我们在合成数据集、医疗数据集及强化学习数据集上进行了全面评估。在不同领域中,我们的通用多模态Transformer在涉及缺失模态元组的任务上始终优于GPT、BERT、CLIP等强基线模型。