Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize: for example, from plain text to image-caption pairs. Most multimodal learning algorithms focus on modeling simple one-to-one pairs of data from two modalities, such as image-caption pairs, or audio-text pairs. However, in most real-world settings, entities of different modalities interact with each other in more complex and multifaceted ways, going beyond one-to-one mappings. We propose to represent these complex relationships as graphs, allowing us to capture data with any number of modalities, and with complex relationships between modalities that can flexibly vary from one sample to another. Toward this goal, we propose Multimodal Graph Learning (MMGL), a general and systematic framework for capturing information from multiple multimodal neighbors with relational structures among them. In particular, we focus on MMGL for generative tasks, building upon pretrained Language Models (LMs), aiming to augment their text generation with multimodal neighbor contexts. We study three research questions raised by MMGL: (1) how can we infuse multiple neighbor information into the pretrained LMs, while avoiding scalability issues? (2) how can we infuse the graph structure information among multimodal neighbors into the LMs? and (3) how can we finetune the pretrained LMs to learn from the neighbor context in a parameter-efficient manner? We conduct extensive experiments to answer these three questions on MMGL and analyze the empirical results to pave the way for future MMGL research.
翻译:多模态学习结合了多种数据模态,拓宽了模型可利用数据的类型和复杂性:例如,从纯文本到图文配对。大多数多模态学习算法专注于建模来自两种模态的简单一对一数据对,如图文对或音频文本对。然而,在大多数真实场景中,不同模态的实体以更复杂、多维的方式相互作用,超越了一对一的映射。我们提出将这些复杂关系表示为图结构,从而能够捕获任意数量模态的数据,以及模态间可随样本灵活变化的复杂关系。为此,我们提出多模态图学习(MMGL),这是一个通用且系统的框架,用于从具有关系结构的多模态邻居中捕获信息。我们特别关注生成任务中的MMGL,基于预训练语言模型(LM),旨在通过多模态邻居上下文增强其文本生成能力。我们研究了MMGL提出的三个研究问题:(1)如何将多个邻居信息注入预训练LM,同时避免可扩展性问题?(2)如何将多模态邻居间的图结构信息注入LM?(3)如何以参数高效的方式微调预训练LM,使其从邻居上下文中学习?我们通过大量实验回答这三个关于MMGL的问题,并通过分析实证结果为未来的MMGL研究铺平道路。