This research paper addresses the challenge of modality mismatch in multimodal learning, where the modalities available during inference differ from those available at training. We propose the Text-centric Alignment for Multi-Modality Learning (TAMML) approach, an innovative method that utilizes Large Language Models (LLMs) with in-context learning and foundation models to enhance the generalizability of multimodal systems under these conditions. By leveraging the unique properties of text as a unified semantic space, TAMML demonstrates significant improvements in handling unseen, diverse, and unpredictable modality combinations. TAMML not only adapts to varying modalities but also maintains robust performance, showcasing the potential of foundation models in overcoming the limitations of traditional fixed-modality frameworks in embedding representations. This study contributes to the field by offering a flexible, effective solution for real-world applications where modality availability is dynamic and uncertain.
翻译:本研究论文探讨了多模态学习中模态不匹配的挑战,即推理阶段可用的模态与训练阶段不同。我们提出了以文本为中心的多模态学习对齐方法(TAMML),该创新方法利用具备上下文学习能力的大语言模型(LLMs)和基础模型,增强了多模态系统在此类条件下的泛化能力。通过利用文本作为统一语义空间的独特特性,TAMML在处理未见过的、多样化的、不可预测的模态组合方面展现了显著改进。TAMML不仅能自适应不同的模态,还能保持稳健的性能,展示了基础模型在克服传统固定模态框架嵌入表示局限性方面的潜力。本项研究通过为模态可用性动态且不确定的实际应用场景提供灵活有效的解决方案,为该领域做出了贡献。