Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.
翻译:尽管多模态大语言模型(MLLMs)近期取得了成功,但现有方法主要假设训练和推理期间均存在多种模态。实践中,多模态数据往往不完整,因为模态可能缺失、异步采集或仅适用于部分样本。在本工作中,我们提出了PRIMO,一种监督隐变量填补模型,用于量化多模态学习环境中任何缺失模态的预测影响。PRIMO能够利用所有可用的训练样本,无论其模态是否完整。具体而言,它通过一个隐变量对缺失模态进行建模,该变量在预测背景下捕获了缺失模态与已观测模态之间的关系。在推理过程中,我们从学习到的缺失模态分布中抽取多个样本,以同时获得边缘预测分布(用于预测目的)并分析缺失模态对每个实例预测的影响。我们在合成XOR数据集、Audio-Vision MNIST以及用于死亡率和ICD-9预测的MIMIC-III数据集上评估了PRIMO。在所有数据集中,当某一模态完全缺失时,PRIMO的性能与单模态基线相当;当所有模态可用时,其性能与多模态基线相当。PRIMO使用基于隐变量补全预测计算出的方差度量,在实例层面量化了模态的预测影响。我们通过可视化展示了缺失模态的不同补全如何产生一组合理的预测标签。