Multi-modal learning has emerged as an increasingly promising avenue in vision recognition, driving innovations across diverse domains ranging from media and education to healthcare and transportation. Despite its success, the robustness of multi-modal learning for visual recognition is often challenged by the unavailability of a subset of modalities, especially the visual modality. Conventional approaches to mitigate missing modalities in multi-modal learning rely heavily on algorithms and modality fusion schemes. In contrast, this paper explores the use of text-to-image models to assist multi-modal learning. Specifically, we propose a simple but effective multi-modal learning framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality by imputing the missing data with generative transformers. Using multiple multi-modal datasets with visual recognition tasks, we present a comprehensive analysis of diverse conditions involving missing visual modality in data, including model training. Our findings reveal that synthetic images benefit training data efficiency with visual data missing in training and improve model robustness with visual data missing involving training and testing. Moreover, we demonstrate GTI-MM is effective with lower generation quantity and simple prompt techniques.
翻译:多模态学习在视觉识别领域已成为一个日益有前景的研究方向,推动了从媒体、教育到医疗和交通等不同领域的创新。尽管取得了成功,但多模态学习在视觉识别中的鲁棒性常常受到部分模态(尤其是视觉模态)不可用性的挑战。传统上,缓解多模态学习中模态缺失的方法严重依赖于算法和模态融合方案。相比之下,本文探索了使用文本到图像模型来辅助多模态学习。具体而言,我们提出了一个简单但有效的多模态学习框架GTI-MM,通过使用生成式变换器对缺失数据进行填补,来增强数据效率和模型对视觉模态缺失的鲁棒性。利用多个带有视觉识别任务的多模态数据集,我们对数据中(包括模型训练期间)涉及视觉模态缺失的多种条件进行了全面分析。我们的研究发现,合成图像有益于在训练中视觉数据缺失时的训练数据效率,并提高在训练和测试中均存在视觉数据缺失情况下的模型鲁棒性。此外,我们证明GTI-MM在较低生成数量和简单提示技术下也是有效的。